DEV Community: Satoru Takeuchi

How Linux Works: Chapter 3 Process Scheduler (Part 3)

Satoru Takeuchi — Sat, 27 Jul 2024 03:45:02 +0000

Context Switch

Switching the process running on a logical CPU is called a 'context switch.' The following figure shows a context switch when Process 0 and Process 1 are present.

Context switches occur without mercy, regardless of the code the process runs when its timeslice expires. Without understanding this, it is easy to misunderstand as follows:

But in reality, there is no guarantee that bar() will be executed immediately after foo(). If the timeslice expires right after the execution of foo(), the execution of bar() could occur sometime later.

Understanding this can provide a different perspective when a certain operation takes longer than expected to complete. Rather than rashly concluding that there must be a problem with the operation itself, one could consider the possibility that a context switch occurred during the operation and another process was running.

Performance

Complying with the system's performance requirements is important. For instance, the following indicators are used for this purpose.

Turnaround Time: The time from when the system is asked to process until each process is finished.
Throughput: The number of processes that can be completed per unit of time.

Let's measure these values. Here, we will get the following performance information for the measure.sh program.

Average Turnaround Time: The average value of the 'real' values of all load.sh processes, which just consumes CPU times.
Throughput: The number of processes over'real' value of the multiload.sh program.

To obtain this information, we use the cpuperf.sh program.

cpuperf.sh

#!/bin/bash

usage() {
    exec >&2
    echo "Usage: $0 [-m] <maximum process count>
1. Save performance information to a file named 'cpuperf.data'
  * The number of entries is <maximum process count>
  * The format of each line is '<process count> <average turnaround time [seconds]> <throughput [processes/second]>'
2. Create a graph of the average throughput based on the performance information and save it as 'avg-tat.jpg'
3. Similarly, create a graph of the throughput and save it as 'throughput.jpg'

The -m option is passed directly to the measure program."
    exit 1
}

measure() {
    local nproc=$1
    local opt=$2
    bash -c "time ./multiload.sh $opt $nproc" 2>&1 | grep real | sed -n -e 's/^.*0m\([.0-9]*\)s$/\1/p' | awk -v nproc=$nproc '
BEGIN{
    sum_tat=0
}
(NR<=nproc){
    sum_tat+=$1
}
(NR==nproc+1) {
    total_real=$1
}
END{
    printf("%d\t%.3f\t%.3f\n", nproc, sum_tat/nproc, nproc/total_real)    
}'
}

while getopts "m" OPT ; do
    case $OPT in
        m)
            MEASURE_OPT="-m"
            ;;
        \?)
            usage
            ;;
    esac
done

shift $((OPTIND - 1))

if [ $# -lt 1 ]; then
    usage
fi

rm -f cpuperf.data
MAX_NPROC=$1
for ((i=1;i<=MAX_NPROC;i++)) ; do
    measure $i $MEASURE_OPT  >>cpuperf.data
done

./plot-perf.py $MAX_NPROC

plot-perf.py

#!/usr/bin/python3

import sys
import plot_sched

def usage():
    print("""usage: {} <max_nproc>
    * create graphs from cpuperf.data
    * "avg-tat.jpg": aveage turnaroun time
    * "throughput.jpg: troughput""".format(progname, file=sys.stderr))
    sys.exit(1)

progname = sys.argv[0]

if len(sys.argv) < 2:
    usage()

max_nproc = int(sys.argv[1])
plot_sched.plot_avg_tat(max_nproc)
plot_sched.plot_throughput(max_nproc)

plot_sched.py

#!/usr/bin/python3

import numpy as np
from PIL import Image
import matplotlib
import os

matplotlib.use('Agg')

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = "sans-serif"
plt.rcParams['font.sans-serif'] = "TakaoPGothic"

def plot_avg_tat(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, y, _ = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("# of processes ")
    ax.set_ylim(0)
    ax.set_ylabel("average turnaround time[s]")

    # Save as png and convert this to jpg to bypass the following bug.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "avg-tat.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

def plot_throughput(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, _, y = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("# of processes")
    ax.set_ylim(0)
    ax.set_ylabel("Throughput[process/s]")

    # Save as png and convert this to jpg to bypass the following bug.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "throughput.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

multiload.sh

#!/bin/bash

MULTICPU=0
PROGNAME=$0
SCRIPT_DIR=$(cd $(dirname $0) && pwd)

usage() {
    exec >&2
    echo "usage: $PROGNAME [-m] <# of processes>
    Run load-testing processes, which just consumes CPU time, and show the elapsed time after waiting for the end of all processes.
    These processes run on one CPU by default.

options:
    -m: allow to run load-testong processes in arbitrary CPUs"
    exit 1
}

while getopts "m" OPT ; do
    case $OPT in
        m)
            MULTICPU=1
            ;;
        \?)
            usage
            ;;
    esac
done

shift $((OPTIND - 1))

if [ $# -lt 1 ] ; then
    usage
fi

CONCURRENCY=$1

if [ $MULTICPU -eq 0 ] ; then
    taskset -p -c 0 $$ >/dev/null
fi

for ((i=0;i<CONCURRENCY;i++)) do
    time "${SCRIPT_DIR}/load.py" &
done

for ((i=0;i<CONCURRENCY;i++)) do
    wait
done

load.sh

#!/usr/bin/python3

NLOOP=100000000

for _ in range(NLOOP):
    pass

First, we'll show the results of limiting the load processing process's execution to one logical CPU and setting the maximum number of processes to 4, i.e., executing ./cpuperf 4.

We can see that increasing the number of processes more than the number of logical CPUs only lengthens the average turnaround time. In addition, it does not improve the throughput.

If you continue to increase the number of processes, the context switches caused by the scheduler will gradually increase the average turnaround time and decrease the throughput. From a performance perspective, simply increasing the number of processes is not enough when the CPU resources are fully utilized.

Let's dig a little deeper into the turnaround time. Suppose you have a web application that does the following processing on your system:

Receives requests from users over The Internet.
Generates HTML files according to the request.
Sends the results back to the user over The Internet.

If such processing arrives anew in a situation where the load on the logical CPU is high, the average turnaround time will get longer and longer. This directly connects to the response time of the web application operation from the user's perspective, compromising the user experience. For a system that prioritizes response performance, keeping each machine's CPU usage lower than a system that prioritizes throughput is necessary.

Next, we will collect data for the case where all logical CPUs can be used. The number of logical CPUs can be obtained by the grep -c processor /proc/cpuinfo command.

grep -c processor /proc/cpuinfo
8

In the author's environment, there are 8 logical CPUs because it has 4 cores and 2 threads.

In this experiment, if the system has SMT enabled, we will disable SMT, as shown below. The reason for disabling it will be discussed in another article.

cat /sys/devices/system/cpu/smt/control # If the output is 'on', SMT is enabled. If the file does not exist, the CPU does not support SMT in the first place.
on

echo off >/sys/devices/system/cpu/smt/control
cat /sys/devices/system/cpu/smt/control
off

grep -c processor /proc/cpuinfo
4

In this state, when the maximum number of processes is set to 8 and performance information is collected, namely when ./cpuperf.sh -m 8 is executed. The results are shown in the following figure.

You can see that the average turnaround time gradually increases until the number of processes equals the number of logical CPUs (4 in this case). However, after that, it increases significantly.

Next, the degree of parallelism improves until it equals the number of logical CPUs, but then it plateaus. From this, we can say the following:

Even if a machine has many logical CPUs, its throughput will only improve if a sufficient number of processes run on it.
Blindly increasing the number of processes will not increase the throughput.

If SMT was enabled before the experiment, it should be re-enabled as follows:

# echo on >/sys/devices/system/cpu/smt/control

The importance of parallel execution of programs is increasing year by year. This is because the approach to improving CPU performance has changed. Once upon a time, with every new generation of CPUs, one could expect a dramatic improvement in the performance per logical CPU, referred to as single-threaded performance. In this case, the processing speed kept increasing without having to modify the program at all. However, the situation has changed over the past decade or so. It has become difficult to improve single-threaded performance due to various circumstances. As a result, when a CPU generation changes, single-threaded performance no longer improves as much as it used to. Instead, the trend has been to increase the total performance of the CPU by increasing the number of CPU cores.

The kernel has also been improving scalability when increasing the number of cores, which is in line with these changing times. As times change, common sense changes, and software changes to adapt to this new common sense.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

How Linux Works: Chapter 3 Process Scheduler (Part 2)

Satoru Takeuchi — Wed, 04 Oct 2023 20:51:12 +0000

Time Slice

In the previous section, we found out that the number of processes that can run simultaneously on one CPU is only one. However, we didn't learn how the CPU resources are distributed from the experiment in the previous section. Therefore, in this section, we will confirm through experiments that the scheduler allows executable processes to use the CPU in time slice units.

We will use a program called sched.py for the experiment.

#!/usr/bin/python3

import sys
import time
import os
import plot_sched

def usage():
    print("""Usage: sched.py <number of processes>
        * After starting <number of processes> load processing processes on logical CPU0, which consume CPU resources for about 100 milliseconds simultaneously, wait for all processes to end.
        * Write out a graph showing the execution results to a file named "sched-<number of processes>.jpg".
        * The x-axis of the graph represents elapsed time [milliseconds] from the start of the load processing process, and the y-axis represents progress [%].""".format(progname, file=sys.stderr))
    sys.exit(1)

# Find the appropriate load for this experimentation.
# This estimation is expected to take several seconds.
# Please increment/decrement NLOOP_FOR_ESTIMATION if this process takes too long/short time.
NLOOP_FOR_ESTIMATION=100000000
nloop_per_msec = None
progname = sys.argv[0]

def estimate_loops_per_msec():
    before = time.perf_counter()
    for _ in  range(NLOOP_FOR_ESTIMATION):
        pass
    after = time.perf_counter()
    return int(NLOOP_FOR_ESTIMATION/(after-before)/1000)

def child_fn(n):
    progress = 100*[None]
    for i in range(100):
        for j in range(nloop_per_msec):
            pass
        progress[i] = time.perf_counter()
    f = open("{}.data".format(n),"w")
    for i in range(100):
        f.write("{}\t{}\n".format((progress[i]-start)*1000,i))
    f.close()
    exit(0)

if len(sys.argv) < 2:
    usage()

concurrency = int(sys.argv[1])

if concurrency < 1:
    print("<the number of processes> should be >= 1: {}".format(concurrency))
    usage()

# Force to run on logical CPU0
os.sched_setaffinity(0, {0})

nloop_per_msec = estimate_loops_per_msec()

start = time.perf_counter()

for i in range(concurrency):
    pid = os.fork()
    if (pid < 0):
        exit(1)
    elif pid == 0:
        child_fn(i)

for i in range(concurrency):
    os.wait()

plot_sched.plot_sched(concurrency)

This program continuously runs one or more load-processing processes that use CPU time and collects the following statistical information:

At a certain point, which process is running on the logical CPU
How much progress each one has made

By analyzing this data, we will check whether the explanation of the scheduler given at the beginning is correct. The specification of the experimental program sched.py is as follows.

Usage: sched <number of processes>
        * After starting <number of processes> load processing processes on logical CPU0, which consume CPU resources for about 100 milliseconds simultaneously, wait for all processes to end.
        * Write out a graph showing the execution results to a file named "sched-<number of processes>.jpg".
        * The x-axis of the graph represents elapsed time [milliseconds] from the start of the load processing process, and the y-axis represents progress [%].

The file plot_sched.py is also used for graph drawing, so if you are running the sched program, please place plot_sched.py in the same directory.

#!/usr/bin/python3

import numpy as np
from PIL import Image
import matplotlib
import os

matplotlib.use('Agg')

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = "sans-serif"
plt.rcParams['font.sans-serif'] = "TakaoPGothic"

def plot_sched(concurrency):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    for i in range(concurrency):
        x, y = np.loadtxt("{}.data".format(i), unpack=True)
        ax.scatter(x,y,s=1)
    ax.set_title("Visualize timeslice(concurrency={})".format(concurrency))
    ax.set_xlabel("elapsed time[ms]")
    ax.set_xlim(0)
    ax.set_ylabel("progress[%]")
    ax.set_ylim([0,100])
    legend = []
    for i in range(concurrency):
        legend.append("load"+str(i))
    ax.legend(legend)

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "sched-{}.png".format(concurrency)
    jpgfilename = "sched-{}.jpg".format(concurrency)
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

def plot_avg_tat(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, y, _ = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("concurrency")
    ax.set_ylim(0)
    ax.set_ylabel("average turn around time[秒]")

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "avg-tat.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

def plot_throughput(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, _, y = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("concurrency")
    ax.set_ylim(0)
    ax.set_ylabel("throughput[process/秒]")

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "throughput.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

This program will be executed with parallelism of 1, 2, and 3, respectively.

for i in 1 2 3 ; do ./sched $i ; done

The results are shown in the following figures.

These graphs show that when multiple processes are running on a single logical CPU, each process is using the CPU alternately in time slices of a few milliseconds.

Column: How Time Slices Work

Looking closely at Figure XX, you can see that each process's time slice is shorter when is 3 compared to when it's 2. In fact, Linux's scheduler is designed to ensure that CPU time is obtained once per period, called a latency target, which is indicated by the value of the sysctl parameter kernel.sched_latency_ns (in nanoseconds).

In the author's environment, this parameter is set to the following value:

$ sysctl kernel.sched_latency_ns
kernel.sched_latency_ns = 24000000  # 24000000/1000000 = 24 milliseconds

The time slice for each process is kernel.sched_latency_ns / [nanoseconds].

The relationship between the latency target and the time slice when there are 1 to 3 executable processes on a logical CPU is shown in the following figure.

In the scheduler of the Linux kernel before 2.6.23, the time slice was a fixed value (100 milliseconds), but this posed a problem as CPU time wouldn't efficiently rotate between processes when the number of processes increased. To improve this issue, the current scheduler adjusts the time slice according to the number of processes.

The calculation of the latency target and time slice values becomes slightly more complicated with

increasing numbers of processes or in the case of multicore CPUs, varying depending on elements such as:

The number of logical CPUs equipped in the system
The number of processes running/waiting on logical CPUs exceeding a certain value
The nice value that represents the priority of the process

In this section, we will discuss the effect of the nice value. The nice value is a setting that defines the execution priority of a process within a range from "-20" to "19" (the default is "0"). -20 is the highest priority and 19 is the lowest. Any user can lower the priority, but only users with root privileges can raise it.

The nice value can be changed using the nice command, renice command, nice() system call, setpriority() system call, and so on. The scheduler gives more time slices to processes with a low nice value (i.e., a high priority).

Let's try running the nice sched_nice.py p with the following specifications:

Usage: sched_nice.py <nice value>
        * After launching two load processing processes that consume about 100 milliseconds of CPU resources on logical CPU0, wait for both processes to finish.
        * The nice values of load processes 0 and 1 are set to 0 (default) and <nice value>, respectively.
        * Writes a graph showing the execution result to a file named "sched-2.jpg".
        * The x-axis of the graph represents the elapsed time from the start of the process [milliseconds], and the y-axis represents the progress [%].

The source code is here.

#!/usr/bin/python3

import sys
import time
import os
import plot_sched

def usage():
    print("""Usage: sched_nice.py <nice value>
        * After launching two load processing processes that consume about 100 milliseconds of CPU resources on logical CPU0, wait for both processes to finish.
        * The nice values of load processes 0 and 1 are set to 0 (default) and <nice value>, respectively.
        * Writes a graph showing the execution result to a file named "sched-2.jpg".
        * The x-axis of the graph represents the elapsed time from the start of the process [milliseconds], and the y-axis represents the progress [%].""".format(progname, file=sys.stderr))
    sys.exit(1)

# Find the appropriate load for this experimentation.
# This estimation is expected to take several seconds.
# Please increment/decrement NLOOP_FOR_ESTIMATION if this process takes too long/short time.
NLOOP_FOR_ESTIMATION=100000000
nloop_per_msec = None
progname = sys.argv[0]

def estimate_loops_per_msec():
    before = time.perf_counter()
    for _ in  range(NLOOP_FOR_ESTIMATION):
        pass
    after = time.perf_counter()
    return int(NLOOP_FOR_ESTIMATION/(after-before)/1000)

def child_fn(n):
    progress = 100*[None]
    for i in range(100):
        for _ in range(nloop_per_msec):
            pass
        progress[i] = time.perf_counter()
    f = open("{}.data".format(n),"w")
    for i in range(100):
        f.write("{}\t{}\n".format((progress[i]-start)*1000,i))
    f.close()
    exit(0)

if len(sys.argv) < 2:
    usage()

nice = int(sys.argv[1])
concurrency = 2

if concurrency < 1:
    print("<the number of processes> should be >= 1: {}".format(concurrency))
    usage()

# Force running on logical CPU0
os.sched_setaffinity(0, {0})

nloop_per_msec = estimate_loops_per_msec()

start = time.perf_counter()

for i in range(concurrency):
    pid = os.fork()
    if (pid < 0):
        exit(1)
    elif pid == 0:
        if i == concurrency - 1:
            os.nice(nice)
        child_fn(i)

for i in range(concurrency):
    os.wait()

plot_sched.plot_sched(concurrency)

Here, let's specify 5 for .

$ ./sched-nice 5

The results are shown in the following graph.

As expected, you can see that load process 0 has more time slices than load process 1.

By the way, the "%nice" field in the output of sar shows the proportion of time that processes with a priority lower than the default value of 0 are executing in user mode (%user is for nice value 0). Let's run the inf-loop program we used in Chapter XX with a lowered priority (we'll set it to 5 here), and check the CPU usage with sar at that time.

$ nice -n 5 taskset -c 0 ./inf-loop &
[1] 168376
$ sar -P 0 1 1
Linux 5.4.0-74-generic (coffee)         2021年12月04日  _x86_64_        (8 CPU)

05時57分58秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
05時57分59秒       0      0.00    100.00      0.00      0.00      0.00      0.00
Average:          0      0.00    100.00      0.00      0.00      0.00      0.00
$ kill 168376

You can see that it's not %user, but %nice that has reached 100.

Note that what we have discussed in this column is not defined by standards such as POSIX, so it may change as the kernel version changes. For example, the default value of kernel.sched_latency_ns has been changed many times in the past. Please be aware that even if you tune your system depending on the behavior mentioned here, it may not necessarily be effective in the future.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

How Linux Works: Chapter2 Process Management (Part3)

Satoru Takeuchi — Sat, 10 Jun 2023 04:49:54 +0000

Job control in Shell

In this section, we will explain the concepts of sessions and process groups, which exist for the implementation of shell's job control.

For those unfamiliar with jobs, a job is a mechanism that shells like bash use to control processes running in the background. For example, you might use it as follows:

$ sleep infinity &
[1] 6176 # [1] is the job number
$ sleep infinity &
[2] 6200 # [2] is the job number
$ jobs # List the jobs
[1]-  Running                 sleep infinity &
[2]+  Running                 sleep infinity &
$ fg 1 # Make job 1 the foreground job
sleep infinity
^Z # Press Ctrl+z and control returns to bash
[1]+  Stopped                 sleep infinity

Sessions

A session corresponds to a login session when a user logs into the system through a terminal emulator like gterm or ssh. All sessions have a terminal attached for controlling the session. When you want to operate processes within the session, you instruct them through the terminal to processes, including the shell, and receive the output from these processes. Normally, a virtual terminal named pty/<n> is assigned to each session.

Let's consider a situation where three sessions exist:

Alice's session: The login shell is bash. Developing a Go program with vim on it, and currently building some program with go build.
Bob's session 1: The login shell is zsh. Using ps aux on it to list all processes in the system, and receiving the results with less.
Bob's session 2: The login shell is zsh. Running a custom calculation program called calc on it."

This situation can be illustrated as follows:

Each session is assigned a unique value, called a "session ID" (or "SID"). Each session has a process called a session leader, which is typically a shell like bash. The PID of the session leader equals the ID of the session. Information about the session can be obtained by ps ajx. In the author's environment, it is as follows:

$ ps ajx
   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME COMMAND
...
  19261   19262   19262   19262 pts/0      19647 Ss    1000   0:00 -bash
...
  19262   19647   19647   19262 pts/0      19647 R+    1000   0:00 ps ajx
...

In this case, we can see that there is a session (SID=19262) with bash(19262) as the session leader, and ps ajx (PID=19647) belongs to this session. Commands launched from bash(19262) normally belong to this session. The TTY field in ps ajx, and in ps aux used in previous sections, is the name of the terminal. In this session, a virtual terminal pts/0 is assigned.

When the terminal associated with the session hangs up, a SIGHUP is sent to the session leader. This happens when the terminal emulator's window is closed. bash will terminate its managed jobs and then terminate itself in this case. For cases where it would be a problem if bash terminated while a long-running process was executing, the following measures can be used:

The nohup command: Launches a process with the setting to ignore SIGHUP. Even if the session terminates and a SIGHUP is sent afterward, the process does not terminate.
The disown built-in command of bash: Removes a running job from bash's management. As a result, even if bash terminates, a SIGHUP will not be sent to the job.

Process Groups

Process groups are used to control multiple processes collectively. Within a session, there exist several process groups. Essentially, you can think of the jobs created by the shell as corresponding to process groups¹.

To be more precise, the shell itself also has its own process group, but for the sake of simplicity, we will omit this from the current discussion.

Let's illustrate process groups with an example. Suppose a session is set up as follows:

The login shell is bash.
From the above bash, go build <source name> & is executed.
From the above bash, ps aux | less is executed.

In this case, bash creates two process groups (jobs) corresponding to go build <source name> & and ps aux | less.

With process groups, signals can be thrown to all processes belonging to a certain process group. The shell uses this feature for job control. If you specify a negative value for the process ID argument of the kill command, you can send a signal to the process group. For example, if you want to send a signal to a process group with a PGID of 100, you can do so with kill -100.

Process groups within a session can be divided into two types:

Foreground process group: Corresponds to the foreground job in the shell. Only one exists in a session and has direct access to the session's terminal.
Background process group: Corresponds to the background job in the shell. When a background process tries to operate the terminal, it temporarily suspends execution as when it receives a SIGSTOP, and this state continues until it becomes a foreground process group (or a foreground job) by commands such as the fg built-in command.

The latter, the foreground process group (foreground job), is the one that can access the terminal directly. This is illustrated as follows.

Each process group is assigned a unique ID known as a PGID. This value can be confirmed by the PGID field in ps ajx. In my environment, it looks like this:

$ ps ajx | less
   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME COMMAND
...
  19261   19262   19262   19262 pts/0      19653 Ss    1000   0:00 -bash
...
  19262   19653   19653   19262 pts/0      19653 R+    1000   0:00 ps ajx
  19262   19654   19653   19262 pts/0      19653 S+    1000   0:00 less
...

From the output, we can see that there is a login session led by bash(19262), within which there is a process group with a PGID of 19653. The constituents of this group are ps ajx(19653) and less(19654), which is piped to it.

Lastly, let me also mention how to distinguish foreground process groups. In the output of ps ajx, the processes belonging to the foreground process group have a + in their STAT field.

The concepts of sessions and process groups can be challenging,

Daemons

You might have heard the term "daemon" in the context of UNIX or Linux countless times. This section will discuss what daemons are and how they differ from regular processes. Simply put, daemons are resident processes. While regular processes are expected to terminate after completing a series of operations initiated by a user, daemons do not necessarily behave this way and may persist from system start to finish, depending on the case.

Daemons have the following characteristics:

They do not require terminal I/O, so no terminal is assigned to them.
They possess their own session to remain unaffected even if all login sessions end.
Init acts as their parent process so that the process generating the daemon does not need to worry about the daemon's termination.

If illustrated, it would look like this:

However, even if they do not meet the above conditions, they may be referred to as daemons if they are resident processes.

You can determine whether a process is a daemon by looking at the results of ps ajx. Let's take a look at sshd, which operates as an ssh server.

$ ps ajx
   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME COMMAND
...
      1     960     960     960 ?             -1 Ss       0   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...

Indeed, the parent process is init (PPID is 1), and the session ID equals the PID.

Because daemons do not possess a terminal, the SIGHUP, which signifies a terminal hang-up, can be utilized for different purposes. By convention, it is often used as a signal for daemons to reread their configuration files.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

How Linux Works: Chapter2 Process Management (Part2)

Satoru Takeuchi — Sat, 10 Jun 2023 04:41:38 +0000

The Relationship between Parent Process and Child Process

In the previous section, we discussed how a parent process generates child processes to create new processes. So where does one end up when following the parent process of the parent process...? This section will clarify this.

When you boot your computer, the system is initialized in the following order:

1.Turn on the power switch
2.Firmware, such as BIOS or UEFI, boots up and initializes the hardware
3.The firmware launches a bootloader like GRUB
4.The bootloader launches the OS kernel. Here we'll consider the Linux kernel
5.The Linux kernel starts the init process
6.The init process starts child processes, which in turn start their child processes, and so on, forming a tree structure of processes

Let's check if this is actually happening.

The pstree command displays the parent-child relationship of processes in a tree structure. pstree displays only command names by default, but it is useful to also display the PID by adding the -p option. In my environment, it looks like this:

$ pstree -p
systemd(1)-+-ModemManager(688)-+-{ModemManager}(723)
           |                   `-{ModemManager}(728)
...
           ├─sshd(960)───sshd(19191)───sshd(19261)───bash(19262)───pstree(19638)
...
$

You can see that the ancestor of all processes is the init process with pid=1 (displayed as systemd on the pstree command). You can also see, for example, that the pstree(19638) was run from bash(19262).

States of Processes

In this section, we will discuss the concept of process states.

As already mentioned, there are always a large number of processes in a Linux system. Do these processes always use the CPU continuously? The answer is no.

The start time of a process running on the system, as well as the total amount of CPU time used, can be checked with the START field and TIME field of ps aux.

$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
sat        19262  0.0  0.0  12888  6144 pts/0    Ss   18:24   0:00 -bash
...

From this output, it can be seen that bash(19262) started at 18:24 and has used almost no CPU time since then. The time at which I am writing this manuscript is around 20:00, so even though it has been over an hour since it was started, you can see that this process has used less than a second of CPU time. The same can be said for many other processes, which I will omit here.

So, what were these processes mainly doing after they started? They were sleeping, waiting for some event to occur without using the CPU. In the case of bash(19262), it was waiting for user input, as there was nothing to do until the user input something. This can be seen from the STAT field of the ps output. A process with an S in the first character of the STAT field is in a sleep state.

On the other hand, a process that wants to use the CPU is said to be in a runnable state. At this time, the first character of STAT becomes R. When a process is actually using the CPU, it is said to be in a running state. How a process transitions between a running state and a runnable state will be discussed in the "Time Slices" and "Context Switches" sections of Chapter 3.

When a process terminates, it becomes a zombie state (STAT field is Z), and then it disappears. The meaning of the zombie state will be explained later.

The states of a process are summarized in the following figure.

As can be seen from this figure, a process transitions through various states during its lifetime.

If all processes on the system are in a sleep state, what is happening on the logical CPU? In fact, at this time, a special process called an idle process that "does nothing" is operating on the logical CPU. The idle process is not visible from ps.

The simplest implementation of an idle process is to perform a wasteful loop until a new process is created or a sleeping process wakes up. However, this wastes power consumption, so it's not usually done. Instead, it uses a special CPU instruction to put the logical CPU into a sleep state, waiting in a state that reduces power consumption until one or more processes become runnable.

One of the main reasons why the battery lasts longer on your notebook PC or smartphone when you are not running any programs is because the logical CPU spends a lot of time in an idle state, which reduces power consumption.

Process Termination

In this section, we'll discuss how a process is terminated by invoking a system call called exit_group(). Like fork or fork-and-exec, when you call the exit() function, this system call is internally invoked. Even if the program itself doesn't call it, libc or other libraries will do so internally. Within exit_group(), the kernel reclaims resources such as memory used by the process (see the following figure).

After a process has terminated, the parent process can obtain the following information through system calls such as wait() or waitpid():

The process's return value. This is equal to the remainder when the argument to the exit() function is divided by 256. To make it clearer, if you specify a number between 0 and 255 as the argument to exit(), the return value will be the same as the argument.
Whether the process was terminated by a signal (discussed later)
How much CPU time the process used before termination

Through this mechanism, for example, you can handle anomalies such as outputting error logs if a process has terminated abnormally based on its return value.

In bash, you can obtain the termination status of a process that has been run in the background using the wait built-in command, which internally calls the wait() system call. Below, we run the wait-ret.sh program, which retrieves and outputs the return value of the always-return-1 false command.

#!/bin/bash

false &
wait $! # wait for the termination of "false" process. We can get the PID of this program through `$!` variable.
echo "The false command has terminated: $?" # We can get the exit state of the process from `$?` variable.

$ ./wait-ret
The false command has terminated: 1

Zombie Processes and Orphan Processes

The fact that a parent process can obtain the state of a child process through wait() system calls implies that, conversely, a child process exists in some form on the system from the time it terminates until its parent process invokes these system calls. A process that has terminated but whose parent has not obtained its termination status is called a zombie process. The name probably comes from the state of being 'dead but not dead', which is indeed a quite vivid term.

Generally, a parent process needs to appropriately reclaim the termination status of its child processes to prevent the system from overflowing with zombie processes and squandering resources. If there are a large number of zombie processes on the system during system operation, it may be worthwhile to suspect a bug in the program corresponding to the parent process.

If the parent of a process terminates before wait(), the process becomes an orphan process. The kernel makes init the new parent of the orphan process. If the parent of a zombie process terminates, the zombie process attacks init. This isn't a pleasant situation for init. However, init is smart and regularly issues wait() to reclaim system resources. It's quite a well-implemented system.

Signals

Processes generally run continuously according to a single stream of execution. Although there are conditional branch instructions, these merely shift the flow according to predefined conditional statements. In contrast, a signal is a mechanism for a process to notify another process and forcibly change the flow of execution from the outside.

There are several types of signals, but the most commonly used is undoubtedly SIGINT. This signal is sent when you press Ctrl+c in a shell like bash. By default, a process that receives SIGINT terminates immediately. Regardless of how the program is structured, the ability to terminate a process the instant a signal is issued is convenient, and many Linux users use this signal, whether they are aware of its effect or not.

Signals can also be sent from outside bash using the kill command. For example, if you want to send SIGINT, execute kill -INT <pid>. In addition to SIGINT, there are signals like the following:

SIGCHLD: Sent to the parent process when a child process terminates. It's common to call wait() within this signal handler.
SIGSTOP: Temporarily suspends the execution of a process. Pressing Ctrl+z on bash halts the execution of the running program. At this time, bash is sending this signal to the process. "- SIGCONT: Resumes the execution of a process that was stopped by SIGSTOP or similar.

You can see a list of signals by running the man 7 signal command.

As I wrote earlier that 'a process that receives SIGINT will terminate by default', it does not mean that a process will always terminate when it receives the SIGINT signal. A process can pre-register a signal handler for each signal. If the process receives the corresponding signal during execution, it temporarily interrupts the current operation, activates the signal handler, and then returns to the original location and resumes operation. Alternatively, it can be set to ignore the signal.

By using a signal handler, you can create an annoying program (intignore.py) that does not terminate even when Ctrl+c is pressed. For example, you can create it in Python like this.

#!/usr/bin/python3

import signal

# Set to ignore SIGINT
# - 1st arg: The signal to ignore
# - 2nd arg: signal handler
signal.signal(signal.SIGINT, signal.SIG_IGN)

while True:
    pass

When you run this program, it will look like this:

$ ./intignore.py
^C^C^C

^C indicates that you've typed Ctrl+c. It's really annoying, isn't it?"

If you've tried this command yourself, please send intignore to the background with Ctrl+z and then kill it with kill. At this time, the default SIGTERM is thrown, so it can terminate.

Column: The Absolutely Lethal SIGKILL Signal and the Absolutely Indestructible Process

SIGKILL could be considered the last resort to use when a process does not die gracefully by other signals like SIGINT. This is a special one among all signals, as a process that receives this signal is always terminated. It is not possible to change the behavior through a signal handler. From the signal name KILL, you can feel the strong intention to definitely kill the process.

However, after writing all this, there are occasionally nefarious processes that won't die, even with SIGKILL. For some reason, these processes are in a special state called uninterruptible sleep, where they do not accept signals for a long time. The first character in the STAT field of ps aux for these processes is D. This often occurs when disk I/O takes a long time. It may also be due to some problem with the kernel. In any case, there is often nothing that can be done from the user's side.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

How Linux Works: Chapter 3 Process Scheduler (Part 1)

Satoru Takeuchi — Fri, 02 Jun 2023 01:48:18 +0000

In Chapter 2, we mentioned that most processes in the system are in the sleep state. So, how does the kernel make each process run on the CPU when there are multiple executable processes in the system? In this chapter, we will discuss the Linux kernel's process scheduler (hereinafter referred to as the scheduler), which is responsible for allocating CPU resources to processes.

In books about computer science, the scheduler is explained as follows:

Only one process can run at a time on a single logical CPU
It allows multiple executable processes to use the CPU in turn, in units called "time slices"

For example, if there are three processes, p0, p1, and p2, it would look like the following figure.

Let's confirm whether this is actually the case in Linux by conducting an experiment.

Elapsed Time and CPU Time

To understand the content of this chapter, it is essential to understand the concepts of elapsed time and CPU time related to processes. This section explains these times. Their definitions are as follows:

Elapsed Time: The time that passes from the start to the end of a process. It is like the time measured with a stopwatch from the start to the end of a process.
CPU Time: The time that a process actually uses a logical CPU.

These terms probably won't make much sense just by looking at the explanation, so let's understand them through an experiment. If you run a process using the time command, you can get the elapsed time and CPU time from the start to the end of the target process. For example, let's run the following "load.py" which terminates after consuming some CPU resources.



#!/usr/bin/python3

# This parameter is to change the amount of CPU time used by this program. It is convenient to change this parameter to make the CPU time consumption several seconds.
NLOOP=100000000

for _ in range(NLOOP):
    pass

The result is as follows.



$ time ./load

real    0m2.357s
user    0m2.357s
sys     0m0.000s

The output includes three lines starting with real, user, and sys. Among these, real indicates the elapsed time, and user and sys indicate the CPU time. user refers to the time when the process was running. In contrast, sys refers to the time when the kernel was operating as a result of system calls issued by the process. The load program continues to use the CPU from the start to the end of its execution and does not issue any system calls during that time, so real and user are almost the same, and sys is almost zero. The reason why it is "almost" is because the Python interpreter calls a few system calls at the beginning and end of the process.

Let's also run an experiment with the sleep command, which sleeps most of the time.



$ time sleep 3

real    0m3.009s
user    0m0.002s
sys     0m0.000s

Since it terminates after sleeping for 3 seconds, "real" is nearly 3 seconds. On the other hand, this command gives up the small mount of CPU time and goes to sleep right after starting, and when it begins to use only a little CPU time again 3 seconds later before terminating. So "user" and "sys" are almost 0. The differences of two values are shown in the following figure.

When Using Only One Logical CPU

To simplify the discussion, let's first consider the case of a single logical CPU. The "multiload.py" program will be used for the experiment. This program performs the following actions:



Usage: ./multiload [-m] <number of processes>
  Executes a specified number of load processing processes for a set period of time and waits for all to finish.
  The time it took to execute each process is outputted.
  By default, all processes run only on a single logical CPU.

  Meaning of the option:
  -m: Allows each process to run on multiple CPUs.

Here is its source code.



#!/bin/bash

MULTICPU=0
PROGNAME=$0
SCRIPT_DIR=$(cd $(dirname $0) && pwd)

usage() {
    exec >&2
    echo Usage: ./multiload [-m] <number of processes>
    Executes a specified number of load processing processes for a set period of time and waits for all to finish.
    The time it took to execute each process is outputted.
    By default, all processes run only on a single logical CPU.

      Meaning of the option:
      -m: Allows each process to run on multiple CPUs."
    exit 1
}

while getopts "m" OPT ; do
    case $OPT in
        m)
            MULTICPU=1
            ;;
        \?)
            usage
            ;;
    esac
done

shift $((OPTIND - 1))

if [ $# -lt 1 ] ; then
    usage
fi

CONCURRENCY=$1

if [ $MULTICPU -eq 0 ] ; then
    # Pin the "load.py" program to CPU0
    taskset -p -c 0 $$ >/dev/null
fi

for ((i=0;i<CONCURRENCY;i++)) do
    time "${SCRIPT_DIR}/load.py" &
done

for ((i=0;i<CONCURRENCY;i++)) do
    wait
done

Let's first set "" to 1 and run it. This is almost the same as running the load program alone.



$ ./multiload 1

real    0m2.359s
user    0m2.358s
sys     0m0.000s

In my environment, the elapsed time was 2.359 seconds. What about when the parallelism is 2 and 3?



$ ./multiload 2

real    0m4.730s
user    0m2.360s
sys     0m0.004s

real    0m4.739s
user    0m2.374s
sys     0m0.000s
$ ./multiload 3

real    0m7.095s
user    0m2.360s
sys     0m0.004s

real    0m7.374s
user    0m2.499s
sys     0m0.000s

real    0m7.541s
user    0m2.676s
sys     0m0.000s

As the level of parallelism increased by 2 times, 3 times, the CPU time did not change much, but the elapsed time increased by about 2 times, 3 times. This is because, as mentioned at the beginning, only one process can run at the same time on one logical CPU, and the scheduler gives CPU resources to each process in turn.

When Using Multiple Logical CPUs

Next, let's also take a look at the case of multiple logical CPUs. When the multiload program is run with the "-m" option, the scheduler tries to distribute the multiple load processes evenly across all logical CPUs. As a result, for instance, if there are two logical CPUs and two load processes, as shown in the following figure, the two load processes can each monopolize the resources of a logical CPU.

The logic of load balancing is very complex, so this book avoids detailed explanation.

Let's actually verify this. The results of running the multiload program with the -m option and parallelism from 1 to 3 are shown below.



$ ./multiload -m 1

real    0m2.361s
user    0m2.361s
sys     0m0.000s
$ ./multiload -m 2

real    0m2.482s
user    0m2.482s
sys     0m0.000s

real    0m2.870s
user    0m2.870s
sys     0m0.000s
$ ./multiload -m 3

real    0m2.694s
user    0m2.693s
sys     0m0.000s

real    0m2.857s
user    0m2.853s
sys     0m0.004s

real    0m2.936s
user    0m2.935s
sys     0m0.000s

For all processes, the values of real and user+sys were almost the same. In other words, we can see that each process was able to monopolize the resources of a logical CPU.

Cases where "user + sys" is larger than "real"

Intuitively, you might think that "real >= user + sys" will always hold, but in reality, there can be cases where the value of "user + sys" is slightly larger than that of "real". This is due to the different methods used to measure each time and the fact that the precision of the measurement is not that high. There's no need to worry too much about this; it's enough to be aware that such things can happen.

Furthermore, there are cases where "user + sys" can be significantly larger than "real". For example, this occurs when you run the "multiload.py" program with the "-m" option and set the number of processes to 2 or more. Now, let's try running "./multiload -m 2" through the time command.



$ time ./multiload -m 2

real    0m2.510s
user    0m2.502s
sys     0m0.008s

real    0m2.725s
user    0m2.716s
sys     0m0.008s

real    0m2.728s
user    0m5.222s
sys     0m0.016s

The first and second entries are data about the load processing processes of the "multiload.py" program. The third entry is data about the "multiload.py" program itself. As you can see, the value of user is about twice that of "real". In fact, the "user" and "sys" values obtained by the "time" command are the sums of the values for the target process and its terminated child processes. Therefore, if a process generates child processes and they each run on a different logical CPU, the value of "user+sys" could be greater than real. The "multiload.py" program fits exactly this condition.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

How Linux Works: Chapter2 Process Management (Part1)

Satoru Takeuchi — Fri, 31 Mar 2023 14:16:46 +0000

It is common for a system to have multiple processes. For example, you can list all processes in the system by running the ps aux command.



$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND ...(1)
...
sat        19261  0.0  0.0  13840  5360 ?        S    18:24   0:00 sshd: sat@pts/0
sat        19262  0.0  0.0  12120  5232 pts/0    Ss   18:24   0:00 -bash
...
sat        19280  0.0  0.0  12752  3692 pts/0    R+   18:25   0:00 ps aux
$

The line (1) is the header line that indicates the meaning of the output in the following lines, and after that, one process is displayed per line. Among them, the COMMAND field represents the command name. We won't go into detail here, but you can see that the sshd ssh server (PID=19261) started bash (PID=19262), which then executed ps aux.

You can remove the header line of the ps command output by using the --no-header option. Now let's check the number of processes in my environment.



$ ps aux --no-header | wc -l
216
$

There were 216 processes. What are each of these processes doing? How are they managed? In this chapter, we will explain the process management system that Linux uses to manage these processes.

Process Creation

There are two main purposes for creating new processes:

a. Divide the processing of the same program into multiple processes (e.g., handling multiple requests by a web server).
b. Generate a different program (e.g., creating various programs from bash).

To achieve these, Linux uses the fork() function and the execve() function¹. Internally, they call the clone() and execve() system calls, respectively. For the case a, only the fork() function is used, while for the case b, both the fork() function and the execve() function are used.

The fork() function that splits the same process into two

When the fork() function is issued, a copy of the process that issued it is made, and both the original and the copy return from the fork() function. The original process is called the parent process, and the generated process is called the child process. The flow at this time is as follows:

The parent process calls the fork() function.
Allocate memory space for the child process and copy the parent process's memory into it.
Both the parent and child processes return from the fork() function. Since the return values of the fork() function differ for the parent.

However, in reality, the memory copy from the parent process to the child process is done at a very low cost thanks to a feature called Copy-on-Write, which will be explained in the following chapter. As a result, the overhead of dividing the processing of the same program into multiple processes in Linux is small.

Let's take a look at how the fork() function generates processes by creating the following python.py program with the following specifications:



#!/usr/bin/python3

import os, sys

ret = os.fork()
if ret == 0:
    print("child process: pid={}, parent process's pid={}".format(os.getpid(), os.getppid()))
    exit()
elif ret > 0:
    print("parent process: pid={}, child process's pid={}".format(os.getpid(), ret))
    exit()

sys.exit(1)

Call the fork() function to branch the process flow.
The parent process outputs its own process ID and the child process's process ID and then exits. The child process outputs its own process ID and then exits.

In the fork.py program, when returning from the fork() function, the parent process gets the child process's process ID, while the child process gets 0. Since the pid is always 1 or more, this can be used to branch the processing after the fork() function call in the parent and child processes.

Now, let's try running it.



./fork.py
parent process: pid=132767, child process's pid=132768
child process: pid=132768, parent peocess's pid=132767

You can see that a process with process ID 132767 has branched and created a new process with process ID 132768, and that after issuing the fork() function, the processing branches according to the return value of fork().

The fork() function can be quite difficult to understand at first, but please try to master it by repeatedly reading the content and sample code in this section.

The `execve()` function for launching a different program

After creating a copy of the process with the fork() function, the execve() function is called on the child process. As a result, the child process is replaced with another program. The flow of processing is as follows:

Call the execve() function.
Read the program specified in the arguments of the execve() function, and read the information required to place the program in memory (called "memory mapping").
Overwrite the current process's memory with the new process's data.
Start executing the process from the first instruction to be executed in the new process (the entry point).

In other words, while the fork() function increases the number of processes, when creating an entirely different program, the number of processes does not increase; instead, one process is replaced with another.

To express this in a program, it would look like following fork-and-exec.py program. Here, after the fork() function call, the child process is replaced with the echo <pid> Hello from <pid> command by the execve() function.



#!/usr/bin/python3

import os, sys

ret = os.fork()
if ret == 0:
    print("child process: pid={}, parent process's pid={}".format(os.getpid(), os.getppid()))
    os.execve("/bin/echo", ["echo", "hello from pid={} ".format(os.getpid())], {})
    exit()
elif ret > 0:
    print("parent process: pid={}, child process's pid={}".format(os.getpid(), ret))
    exit()

sys.exit(1)

The result of the execution looks like this:





```console
$ ./fork-and-exec.py
parent process: pid=5843, child process's pid=5844
child process: pid=5844, parent process's pid=5843
hello from pid=5844


If we illustrate this result, it would look like the following figure. For simplicity, we omit the loading of the program by the kernel and copying the program to memory.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8v0an3dq0lvxi71rmats.jpg)

To implement the `execve()` function, the executable file holds the necessary data for starting the program in addition to the program code and data. This data includes:

- The file offset, size, and memory map starting address of the code area.
- The same information for the data area.
- The memory address of the first instruction to execute (entry point).

Now let's take a look at how Linux executable files hold this information. Linux executable files are usually in the Executable and Linking Format (ELF) format. Various information about ELF can be obtained using the `readelf` command.

We will use the `pause` program again here, which we used in the following chapter. Let's start with the build.

```c


#include <unistd.h>

int main(void) {
    pause();
    return 0;
}



$ cc -o pause -no-pie pause.c # build pause program with -no-pie option. See the next column to know the meaning of this option

The program's starting address can be obtained using readelf -h.



$ readelf -h pause
...
  Entry point address:               0x400400
...

"0x400400" on the "Entry point address" line is the entry point of this program.

The file offset, size, and starting address of the code and data can be obtained using the readelf -S command.



$ readelf -S pause
There are 29 section headers, starting at offset 0x18e8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
...
  [13] .text             PROGBITS         0000000000400400  00000400
       0000000000000172  0000000000000000  AX       0     0     16
...
  [23] .data             PROGBITS         0000000000601020  00001020
       0000000000000010  0000000000000000  WA       0     0     8
...

Although we obtained a large amount of output, understanding the following points is sufficient:

The executable file is divided into multiple regions, each of which is called a section.
Information about each section is displayed as a pair of two lines.
All values are in hexadecimal.
The main information about each section is as follows:
- Section name: the second field "Name" in the first line
- Memory map starting address: the fourth field "Address" in the first line
- File offset: the fifth field "Offset" in the first line
- Size: the first field "Size" in the second line
Sections with the name ".text" are code sections, and those with the name ".data" are data sections.

Summarizing this information, see the following table.

Name	Value
File offset of the code	0x400
Size of the code	0x172
Memory map starting address of the code	0x400400
File offset of the data	0x1020
Size of the data	0x10
Memory map starting address of the data	0x601020
Entry point	0x400400

The memory map of the process created from the program can be obtained using the /proc/<pid>/maps file. Now let's take a look at the memory map of the pause command.



$ ./pause &
[3] 12492
$ cat /proc/12492/maps
00400000-00401000 r-xp 00000000 08:02 788371                             .../pause ... (1)
00600000-00601000 r--p 00000000 08:02 788371                             .../pause
00601000-00602000 rw-p 00001000 08:02 788371                             .../pause ... (2)
...

(1) is the code region, and (2) is the data region. We can see that they fit within the memory map ranges shown in the above-mentioned table.

Once you have finished with this, let's exit the pause process.



$ kill 12492

Security enhancement with Address Space Layout Randomization feature

In this section, we explain the meaning of the -no-pie option attached during the build of the pause program in the previous section. This option is related to a security feature called Address Space Layout Randomization (ASLR) that the Linux kernel has. ASLR is a feature that maps each section of a program to a different address every time it is executed. Thanks to this, attacks that assume the target code or data exists at a specific address become difficult.

The conditions for using this feature are as follows:

The ASLR function of the kernel is enabled. It is enabled by default in Ubuntu 20.04².
The program supports ASLR. Such programs are called Position Independent Executable (PIE).

gcc (in this book's example, the cc command) in Ubuntu 20.04 builds all programs as PIE by default, but PIE can be disabled with the -no-pie option. In the previous section, we set it up intentionally because if the PIE of the pause command is not disabled, it causes confusion due to reasons such as the values in /proc/<pid>/maps not being the same as those written in the executable file and being different every time it is executed.

Whether a program is PIE or not can be checked with the file command. If it is supported, the following output is obtained:



$ file pause
pause: ELF 64-bit LSB shared object, ...
$

If it is not PIE, the following output is obtained:



$ file pause
pause: ELF 64-bit LSB executable, ...
$

For reference, let's run the pause program built without the -no-pie option twice and check where the code section is mapped to memory each time.



$ cc -o pause pause.c
$ ./pause &
[5] 15406
$ cat /proc/15406/maps
559c5778f000-559c57790000 r-xp 00000000 08:02 788372                     .../pause
...
$ ./pause &
[6] 15536
$ cat /proc/15536/maps
5568d2506000-5568d2507000 r-xp 00000000 08:02 788372                     .../pause
...
$ kill 15406 15536

As you can see, it is mapped to different locations on the first and second runs.

Actually, programs distributed as part of Ubuntu 20.04 are PIE as much as possible. It's wonderful that security is enhanced automatically without users or programmers being particularly aware of it. However, there are also security attacks to bypass ASLR, so the history of security technology is a game of cat and mouse.

Column: Process Generation Methods Other Than `fork()` and `execve()`

Using fork() and execve() functions in sequence to create a program from within a process may seem redundant. In such cases, the posix_spawn() function defined in the C language interface specification for UNIX-based operating systems, POSIX, can simplify the process.

Below is a program named spawn.py that generates an echo command as a child process using the posix_spawn() function:



#!/usr/bin/python3

import os

os.posix_spawn("/bin/echo", ["echo", "echo", "created by posix_spawn()"], {})
print("created echo command")



$ ./spawn.py
created echo command
created by posix_spawn()

This is achieved using fork() and execve() functions in spawn-by-fork-and-exec.py program , as shown below:



#!/usr/bin/python3

import os

ret = os.fork()
if ret == 0:
    os.execve("/bin/echo", ["echo", "created by fork() and execve()"], {})
elif ret > 0:
    print("created echo")



$ ./spawn-by-fork-and-exec.py
created echo
created by fork() and execve()

As you can see, the spawn program has better source code readability.

Although process generation using posix_spawn() is intuitive, it may become unnecessarily complex if more sophisticated operations such as shell implementations are desired, compared to using the fork() and execve() functions. For reference, I uses the posix_spawn() function only when calling the execve() function immediately after the fork() function without performing any other operations. Otherwise, I uses fork() and execve() functions for all cases.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

If you run man 3 exec, you will find many variations of the execve() function. ↩
For reference, to disable ASLR in the kernel, set the kernel.randomize_va_space parameter of sysctl to 0. ↩

ATARI is still alive: Atari Partition of Fear

Satoru Takeuchi — Tue, 28 Mar 2023 00:26:32 +0000

Introduction

This article explains the data corruption issue happened in Rook in 2021. The root cause lies in an unexpected place and can also occurs in all Ceph environment. It's interesting that Rook had started to encounter this problem recently even though this problem has existed for a long time. It's due to a series of coincidences. I wrote this article because the word "Atari" used in a non-historical context in 2021.

This article is a restructured version of the information written in Rook's official documentation, with additional information for those who are not familiar with Rook.

Glossary

Ceph: A open source distributed storage system
Rook: An irchestration of Ceph running on Kubernetes. This is also open source
OSD: Data structure existing on disks that usually corresponds to a disk in Ceph cluster
OSD on disk: One of the methods to create OSDs on Rook. The device path is directly written in the Rook configuration. Details will be discussed later
Atari partition: The partition format used in the once-existing Atari ST computer

Problem Summary

The problem:
- OSD data gets corrupted
Root Cause Summary
- The disk containing the OSD is mistakenly recognized as having an Atari partition, and Rook creates an OSD on that (non-existent) partition
Occurrence Conditions:
- Using Rook v1.6.0 to v1.6.7
- Creating OSD on disk on a disk without partitioning
Mitigation:
- Update to Rook v1.6.8 or higher, or Rook v1.7
Recovery from Data Corruption:
- Impossible. The only option is to recreate the OSD on the disk holding the corrupted OSD using this procedure

Mechanism

Let's assume that Rook tries to create an OSD on a disk called /dev/sdb

Rook creates an OSD on /dev/sdb.
Linux Kernel recognizes that there is an Atari Partition Table, not an OSD, on /dev/sdb by mistake.
Rook found there are some "phantom" empty Atari Partitions that can be used to create OSDs.
Rook creates an OSD on the empty partitions mentioned in step 3, resulting in data corruption of the OSD on /dev/sdb.

Details

To understand the problem in detail, some knowledge about Rook, Ceph, and Atari Partition is required. I will first explain some prerequisite knowledge, and then describe the actual flow leading up to the problem.

Configuring OSD on device in Rook

When creating an OSD in Rook, you write settings such as "I want to create an OSD under these conditions" in a CephCluster Custom Resource (CR). To create an OSD on device, you specify the following:

The path of the device where you want to create the OSD ("/dev/sdb", etc.)
A regular expression to match the device ("/dev/sd.*", etc.)
The specification "create on all unused devices, useAllDevice: true"

Please refer to the official documentation for details.

When Rook operates, it creates OSDs on each device according to the Ceph Cluster CR as follows:

Rook on the node composing the Rook cluster executes a command called ceph-volume provided by Ceph.
The ceph-volume command lists the devices present in the system and displays whether they are empty and can be used to create OSDs.
Rook creates OSDs on devices that are empty and match the settings in the Ceph Cluster CR.

OSD Formats in Ceph

When Ceph creates an OSD on a device, it writes OSD metadata to the device. There are two formats of OSD in Ceph, each with different locations for writing metadata.

lvm mode OSD: Create a Volume Group (VG) in LVM on the device, then create a Logical Volume (LV) within it, and write the OSD metadata to the beginning of the LV.
raw mode OSD: Write the OSD metadata to the beginning of the device.

Ceph only have lvm mode at first and introduced the simpler and easier-to-manage raw mode OSD later. In Rook, starting from v1.6.0, raw mode OSDs are created for OSD on device.

Atari Partition Recognition Method in the Linux Kernel

The recognition method for Atari Partition in the Linux kernel is much more lenient compared to other partitions. To determine whether the fundamental problem lies in the Linux kernel or in the Atari partition specification itself, it is necessary to look at the Atari partition specification, but since it could not be found, the investigation was limited to the source code.

The determination of whether a disk is an Atari Partition or not is based on the presence of one or more partition information in the beginning area of the disk. There can be up to 4 partitions¹, and the method of checking this is the VALID_PARTITION() macro.

https://github.com/torvalds/linux/blob/3a93e40326c8f470e71d20b4c42d36767450f38f/block/partitions/atari.c#L53-L70

    rs = read_part_sector(state, 0, &sect);
    if (!rs)
        return -1;

    /* Verify this is an Atari rootsector: */
    hd_size = get_capacity(state->disk);
    if (!VALID_PARTITION(&rs->part[0], hd_size) &&
        !VALID_PARTITION(&rs->part[1], hd_size) &&
        !VALID_PARTITION(&rs->part[2], hd_size) &&
        !VALID_PARTITION(&rs->part[3], hd_size)) {
        /*
         * if there's no valid primary partition, assume that no Atari
         * format partition table (there's no reliable magic or the like
             * :-()
         */
        put_dev_sector(sect);
        return 0;
    }

According to the comments, it seems that there is no magic number-like element that naturally exists in other partition tables.

The definition of VALID_PARTITION() is as follows.

https://github.com/torvalds/linux/blob/3a93e40326c8f470e71d20b4c42d36767450f38f/block/partitions/atari.c#L19-L25

/* check if a partition entry looks valid -- Atari format is assumed if at
   least one of the primary entries is ok this way */
#define VALID_PARTITION(pi,hdsiz)                        \
    (((pi)->flg & 1) &&                              \
     isalnum((pi)->id[0]) && isalnum((pi)->id[1]) && isalnum((pi)->id[2]) && \
     be32_to_cpu((pi)->st) <= (hdsiz) &&                     \
     be32_to_cpu((pi)->st) + be32_to_cpu((pi)->siz) <= (hdsiz))

It is quite frightening that a disk can be mistakenly recognized as a partition just by meeting such loose conditions.

The process leading to the problem

Let's consider the case of trying to create an OSD on /dev/sdb again. For simplicity, let's assume that the Rook configuration is set to create OSDs on all available devices.

First, Rook operates and creates an OSD on /dev/sdb. So far, so good. From v1.6.0 to v1.6.7, the OSD metadata is written to the beginning of the disk to create a raw mode OSD. Unfortunately, this OSD metadata has a bit pattern that is easily mistaken for an Atari Partition Table².

You can check the devices where the misrecognition occurred from the results of the lsblk command as follows.

vdb    252:16   0    3T  0 disk 
├─vdb2 252:18   0   48G  0 part # phantom Atari Partition 
└─vdb3 252:19   0  6.1M  0 part # same as above

Interestingly, tools like lsblk, blkid, udevadm, and parted have been confusing users and developers by not recognizing Atari Partitions, making it appear as if vdb2 and vdb3 do not exist, or setting the partition table type to unknown. This is why the misrecognized partitions are called "phantom."

After this, for some reason, when Rook operates next time, the ceph-volume command is executed, and it mistakenly recognizes that there is an Atari Partition Table on /dev/vdb instead of an OSD, along with some phantom Partitions. Let's assume that /dev/vdb2 is the phantom partition here. As a result of this misrecognition, /dev/vdb is reported to be in use, but /dev/vdb2 is empty and can have an OSD created on it.

Finally, Rook receives instructions from the user to create OSDs on free devices, so it creates a new OSD on /dev/vdb2. This results in the partial destruction of the data on the original OSD on /dev/vdb.

History of Handling the Issue

There are three parties involved in this issue: Rook, Ceph, and the Linux Kernel. In fixing such issues, it is important to consider "which layer can fix it" and "which layer should fix it."

Various workarounds have been proposed after twists and turns, but it has been determined that there is not much that can be done at a superficial level. Currently, in Rook, a fix to create lvm mode OSDs when creating OSDs on disk has been incorporated.

In addition, a fix for ceph-volume to ignore phantom Atari partitions has already been made to Ceph, and it has fixed in the v16.2.6 rel. Once v16.2.6 was released, raw mode started to use again for OSD on disk in Rook when using that version.

As for the Linux kernel, a fix to disable Atari Partition support in the kernel for major cloud vendor environments in Ubuntu has been made. It is not written what problem prompted the fix, so it may have come up in a context unrelated to Rook or Ceph.

Conclusion

It was quite a dramatic issue, but I think it's a good example to learn how to fix software and make announcements in such cases. If you want to learn more, I recommend digging deeper by following related issues.

This may be a limitation of Linux, but it is unclear. ↩
Fortunately, there have been no reports of OSDs being mistaken for Atari Partitions when using lvm mode OSDs. However, this is simply because the metadata of the VG is at the beginning of the disk in the case of lvm mode, and the pattern just happens to not be recognized as an Atari Partition. ↩

How Linux Works: Chapter1 Linux Overview (Part2)

Satoru Takeuchi — Sun, 26 Mar 2023 07:26:33 +0000

Libraries

In this section, we will discuss libraries provided by the operating system. Many programming languages offer the ability to bundle commonly used functions across multiple programs into libraries. This allows programmers to efficiently develop programs by choosing from a vast array of libraries created by their predecessors. Some libraries, which are expected to be used by a large number of programs, may be provided by the operating system.

The following figure shows the software hierarchy when a process is using a library.

C language has a standard library defined by the International Organization for Standardization (ISO). Linux also provides this standard C library. Typically, the glibc provided by the GNU project GNU is used as the standard C library. In this book, we will refer to glibc as libc.

Almost all C programs written in C language are linked with libc.

You can use the ldd command to check which libraries a program is linked with. Let's take a look at the ldd output for the echo command.

$ ldd /bin/echo
        linux-vdso.so.1 (0x00007ffef73a9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2925ebd000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f29260d1000)
$

In the above example, libc.so.6 refers to the standard C library. Also, ld-linux-x86-64.so.2 is a special library for loading shared libraries, which is also one of the libraries provided by the OS.

Let's also check the cat command.

$ ldd /bin/cat
        linux-vdso.so.1 (0x00007ffc3b155000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fabd1194000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fabd13a9000)
$

This also links to libc. Let's also look at the python3 command, which is the Python3 interpreter.

$ ldd /usr/bin/python3
        linux-vdso.so.1 (0x00007ffc91126000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5fb7206000)
 ...
        /lib64/ld-linux-x86-64.so.2 (0x00007f5fb740f000)
$

Again, libc is linked. In other words, when executing Python programs, the standard C library is used internally. Although few people may use C language directly nowadays, it can be seen that it remains an important language as the backbone of the OS level.

If you run the ldd command for various programs existing in the system, you will see that many of them are linked with libc. Please give it a try.

In Linux, in addition to this, standard libraries for various programming languages, such as C++, are provided. It also offers libraries that, while not standard, many programmers are likely to use. In Ubuntu, library files often begin with the string "lib". When I ran dpkg-query -W | grep lib in my environment, over 1000 packages were displayed.

Wrapper Functions for System Calls

libc not only provides the standard C library but also offers something called "wrapper functions" for system calls. System calls cannot be directly called from high-level languages such as C, unlike regular function calls. They must be invoked using architecture-dependent assembly code.

For example, in the x86_64 CPU architecture, the getppid() system call is issued at the assembly code level as follows:

mov    $0x6e,%eax
syscall

In the first line, the system call number "0x6e" for getppid() is assigned to the eax register. This is determined by the Linux system call calling convention. The second line issues the system call and transitions to kernel mode via the syscall instruction. After this, the kernel code that processes getppid() is executed. If you don't usually write assembly language, you don't need to understand the detailed meaning of this source here. Just get a feel for the atmosphere that it's obviously different from the source code you normally see.

In the arm64 architecture, which is mainly used in smartphones and tablets, the getppid() system call is issued at the assembly code level as follows:

mov     x8,  <system call number>
svc     #0

Quite different, isn't it? Without the help of libc, every time you issue a system call, you would have to write architecture-dependent assembly source code and call it from a high-level language.

This would make program creation more time-consuming and not portable to other architectures.

To solve such problems, libc provides a series of functions called "wrapper functions" for system calls, which internally just call the system calls. Wrapper functions exist for each architecture. From user programs written in high-level languages, you only need to call the system call wrapper functions prepared for each language.

Static Libraries and Shared Libraries

Libraries can be classified into two types: static libraries and shared (or dynamic) libraries. Both provide the same functionality, but the way they are incorporated into a program is different.

When creating a program, first, you compile the source code to create a file called an object file. Then, you link the library used by the object file to create the executable file. At link time, static libraries incorporate the functions within the library into the program. In contrast, shared libraries only embed information such as "call this function of this library" in the executable file at link time. Then, at program startup or during execution, the library is loaded into memory, and the program calls the functions within it.

The following figure shows the difference between the two in the case of a pause program that only calls the pause() system call and does nothing else.

And here is the source code of pause.

#include <unistd.h>

int main(void) {
    pause();
    return 0;
}

Let's verify if my explanation is correct with the following perspectives:

The size of pause program
Link status with shared libraries

As an example, let's consider linking the libc library to the program. First, let's check the case of using the static library "libc.a"¹.

$ cc -static -o pause pause.c
$ ls -l pause
-rwxrwxr-x 1 sat sat 871688  Feb 27 10:29 pause  ... (1)
$ ldd pause
        not a dynamic executable   ... (2)
$

The execution results show the following:

(1) The program size is just under 900KB
(2) No shared libraries are linked

Since this program already incorporates libc, it will still work if "libc.a" is deleted. However, doing so would be very dangerous because other programs would no longer be able to statically link with libc, so please do not do this.

Next, let's consider the case of using the shared library "libc.so"².

$ cc -o pause pause.c
$ ls -l pause
-rwxrwxr-x 1 sat sat 16696  Feb 27 10:43 pause
$ ldd pause
        linux-vdso.so.1 (0x00007ffc18a75000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f64ad4e9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f64ad6f7000)
$

From these results, we can see the following:

The size is about 16KB, which is a fraction of the size when libc is statically linked.
libc ("/lib/x86_64-linux-gnu/libc.so.6") is dynamically linked.

The pause command with dynamically linked libc will not execute if libc.so is deleted. In fact, doing so is even more dangerous than deleting libc.a, as it would render all programs that link to libc.so inoperable. If this happens, you'll need to use complex methods to recover or reinstall the entire OS. Please do not do this under any circumstances.

The reason for the small size is that libc is not embedded in the program itself but is loaded into memory at runtime. Instead of using separate copies of libc code for each program, all programs using libc share the same instance.

Both static and shared libraries have their pros and cons, so it's hard to say which is better overall. However, shared libraries have been mainly used for the following reasons:

They keep the overall storage consumption low.
If there's an issue with the library, replacing the new shared library will resolve the problem for all programs using that library.

It might be interesting to run the ldd command on the executable files of the programs you use to see which shared libraries are linked.

Column: The Revival of Static Linking

In this article, I mentioned that shared libraries have been preferred, but the situation has changed slightly in recent years. For example, the popular Go language, which has gained popularity in the past few years, statically links most libraries by default. As a result, most Go program does not depend on any shared libraries.

Let's run ldd on the hello program, which is written in Go, to verify this.

$ ldd hello
        not a dynamic executable

There are various reasons for this, such as:

The size issue has become relatively smaller thanks to the large capacity of memory and storage in modern computers.
If a program can run with just a single executable file, it is easier to handle since you can simply copy the file to run in another environment.
Faster startup as there is no need to link shared libraries at runtime.
Shared libraries have issues, such as some programs not working due to library version upgrades, because the behavior of different versions of libraries that should originally work the same can be subtly different (so called "DLL Hell").

There are various ways of thinking, and the appropriate method changes over time.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

In Ubuntu 20.04, this is provided by the libc6-dev package. ↩
In Ubuntu 20.04, this is provided by the libc6 package. ↩

Should we always use quicksort than insertion sort?

Satoru Takeuchi — Sun, 26 Mar 2023 06:28:13 +0000

It's often said that "Use always O(n*log(n)) algorhithms". However, depending on the situation, it might not always be right. In this article, I will compare quicksort and insertion sort.

When written in Big O notation, the computational complexity of quicksort is O(n*log(n)), while that of insertion sort is O(n^2). So, it seems like the former would be faster at a glance. However, this notation only describes the behavior when n approaches infinity, so in real-world programs, quicksort is not always faster.

Let's compare the speeds of both algorithms in a program with the following specifications:

Sort an array with a given number of integer elements and output the time it takes to do so.
First argument (len): Number of elements.
Second argument (type): Type of algorithm. 'i' for insertion sort, 'q' for quicksort.

The following sort program implements these specifications.

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

#define NSECS_PER_MSEC 1000000UL
#define NSECS_PER_SEC 1000000000UL

static void insertion_sort(int *a, int len)
{
    int i, tmp;
    for (i = 1; i < len; i++) {
        tmp = a[i];
        if (a[i - 1] > tmp) {
            int j;
            j = i;
            do {
                a[j] = a[j - 1];
                j--;
            } while (j > 0 && a[j - 1] > tmp);
            a[j] = tmp;
        }
    }
}

int pivot(int a[], int l, int r){
    int i = l, j = r-1;
    int p = a[r];
    int tmp;
    for (;;) {
        while (a[i] < p)
            i++;
        while (i < j && p < a[j])
            j--;
        if (i >= j)
            break;
        tmp = a[i];
        a[i] = a[j];
        a[j] = tmp;
    }
    tmp = a[i];
    a[i] = a[r];
    a[r] = tmp;

    return i;
}

void quick_sort_inner(int *a, int l, int r){
    if (l >= r)
        return;
    int v = pivot(a, l, r);
    quick_sort_inner(a, l, v-1);
    quick_sort_inner(a, v+1, r);
}

static void quick_sort(int *a, int len)
{
    quick_sort_inner(a, 0, len-1);
}

static inline long diff_nsec(struct timespec before, struct timespec after)
{
    return ((after.tv_sec * NSECS_PER_SEC + after.tv_nsec)
        - (before.tv_sec * NSECS_PER_SEC + before.tv_nsec));

}

static void prepare_data(int *a, int len)
{
    int i;
    for (i = 0; i < len; i++)
        a[i] = rand();
}

static int comp(const void *a, const void *b)
{
    return *((int *)a) - *((int *)b);
}

static char *progname;

int main(int argc, char *argv[])
{
    progname = argv[0];

    if (argc < 3) {
        fprintf(stderr, "usage: %s <len> <i|q>\n", progname);
        exit(EXIT_FAILURE);
    }

    int len = atoi(argv[1]);

    char type = argv[2][0];

    if (!((type == 'i') || (type == 'q'))) {
        fprintf(stderr, "%s: type should be 'i or q'\n", progname);
        exit(EXIT_FAILURE);
    }

    int *a;
    a = malloc(len * sizeof(int));
    prepare_data(a, len);

    struct timespec before, after;
    if (type == 'i') {
        clock_gettime(CLOCK_MONOTONIC, &before);
        insertion_sort(a, len);
        clock_gettime(CLOCK_MONOTONIC, &after);
    } else {
        clock_gettime(CLOCK_MONOTONIC, &before);
        qsort(a, len, sizeof(int), comp);
        clock_gettime(CLOCK_MONOTONIC, &after);
    }

    printf("%lu\n", diff_nsec(before, after));

    exit(EXIT_SUCCESS);
}

Let's build this progam.

$ CFLAGS=-O3 make sort
$

I ran this program with the following parameters.

Parameter	Value
First argument (`len`)	2^(0, 1, 2, ..., 15)
Second argument (`type`)	"i", "q"

I plotted the results on the following graph. Note that the x-axis is 2^(len) and the y-axis is log(elapsed time).

As len gets larger, quicksort indeed becomes faster, and the difference only grows. However, when len is small, it can be seen that insertion sort is faster up to around 2^8 (=128). This is because quicksort performs more complex processing compared to insertion sort. Therefore, using quicksort is generally not a bad choice, but if you want to fine-tune your program and know that len will not be that large, trying to speed up using insertion sort is one option. However, as the saying goes, "premature optimization", there is no need to start with this kind of fine-tuning.

One last point. For the reasons mentioned above, the qsort() function in glibc, internally uses insertion sort for a somewhat small len (<=MAX_THRESH). There are many other implementations that work this way. If you're interested, it's a good idea to take a look at various quicksort implementations.

How Linux Works: Chapter1 Linux Overview (Part1)

Satoru Takeuchi — Sat, 25 Mar 2023 17:20:00 +0000

In this chapter, we will discuss what Linux and its component, the kernel, are, as well as the differences between Linux and other systems within the context of the entire system. We will also explain the meanings of words like programs and processes, which tend to be used in the same context.

Programs and Processes

Various programs are running on Linux. A program is a set of instructions and data that work together on a computer. In compiled languages like Go, the executable file after building the source code is considered a program. In script languages like Python, the source code itself is considered an executable file. The kernel is also a type of program.

When you turn on the machine, the kernel starts first¹. All other programs start after the kernel.

There are various types of programs running on Linux, such as:

Web browsers: Chrome, Firefox, etc.
Office suites: Libre Office, etc.
Web servers: Apache, Nginx, etc.
Text editors: Vim, Emacs, etc.
Programming language processors: C compiler, Go compiler, Python interpreter, etc.
Shells: bash, zsh, etc.
System-wide management software: systemd, etc.

A program that is running after startup is called a process. Since the term "program" is sometimes used to refer to a running process, it can be said that "program" has a broader meaning than "process."

Kernel

In this section, we will explain what a kernel is and why it is necessary, using the example of accessing storage devices like HDDs and SSDs connected to the system.

First, let's consider a system where processes can directly access storage devices.

In this case, problems arise when multiple processes try to operate on the device simultaneously.

To read and write data from storage devices, suppose you need to issue the following two commands:

Specify the location to read or write data.
Read or write data from the location specified in command 1.

If two processes, A and B, are writing data and reading data from another location simultaneously, the commands might be issued in the following order:

Process A specifies the location to write data.
Process B specifies the location to read data.
Process A writes data.

In command 3, the location to write data is not the one specified in command 1 but the one specified in command 2 causing the data at the location specified in command 2 to be corrupted. As you can see, accessing storage devices is very dangerous if the order of command execution is not properly controlled².

In addition to this, there are problems where programs that should not have access to the device can access it.

To avoid these problems, the kernel, with the help of hardware, prevents processes from directly accessing devices. Specifically, it uses a feature called mode built into the CPU.

General-purpose CPUs used in personal computers and servers have two modes: kernel mode and user mode. More precisely, there are more than three modes depending on the CPU architecture, but we will omit them here³. When a process is running in user mode, it is said to be running in userland (or user space).

While there are no restrictions in kernel mode, certain instructions cannot be executed when running in user mode.

In Linux, only the kernel operates in kernel mode and can access devices. In contrast, processes operate in user mode and cannot access devices. Therefore, processes access devices indirectly through the kernel.

The functionality of accessing devices, including storage devices, through the kernel is described in detail in Chapter 6.

In addition to the device control mentioned above, the kernel centrally manages resources shared by all processes in the system and allocates them to processes running on the system. The program that operates in kernel mode for this purpose is the kernel itself.

System Calls

System calls are a method for processes to request the kernel to perform tasks. They are used when the kernel's help is needed, such as creating new processes or operating hardware.

Examples of system calls include:

Creating and deleting processes
Allocating and deallocating memory
Communication processing
File system operations
Device operations

System calls are implemented by executing special instructions on the CPU. As previously mentioned, processes run in user mode, but when a system call is issued to request the kernel for processing, an event called an exception occurs on the CPU (exceptions are explained in the "Page Tables" section of Chapter 4). Triggered by this event, the CPU transitions from user mode to kernel mode, and the kernel processing corresponding to the request begins. Once the system call processing within the kernel is complete, the CPU returns to user mode and the process continues its operation.

At the beginning of the system call processing, the kernel checks whether the request from the process is legitimate (e.g., whether it is not requesting an amount of memory that does not exist in the system). If the request is illegitimate, the system call fails.

There is no way for a process to change the CPU mode directly without going through a system call. If there were, there would be no point in having a kernel. For example, if a malicious user were to change the CPU to kernel mode from a process and directly operate a device, they could eavesdrop on or destroy other users' data.

Visualizing System Call Invocations

You can check what system calls a process issues using the strace command. Let's try running a simple hello program that only outputs the string "hello world" through strace.



package main

import (
    "fmt"
)

func main() {
    fmt.Println("hello world")
}

First, build and run the program without strace.



$ go build hello.go
$ ./hello
hello world

As expected, it displayed "hello world". Now let's see what system calls this program issues using strace. You can specify the output destination of strace with the "-o" option.



$ strace -o hello.log ./hello
hello world

The program terminated with the same output as before. Now let's take a look at the contents of "hello.log", which contains the output of strace.



$ cat hello.log
...
write(1, "hello world\n", 12)           = 12 ... (1)
...

The output of strace corresponds to one system call per line. You can ignore the detailed numerical values and just look at the string at the beginning of each line. From line (1), you can see that the program is using the write() system call to output the string "hello world\n" (where \n represents a newline character) to the screen or a file.

In my environment, totally 150 system calls were issued. Most of these were issued by the program's startup and shutdown processes (which are also provided by the operating system) and are not something you need to worry about.

Regardless of the programming language used, when a program requests processing from the kernel, it issues a system call. Let's confirm this by examining the hello.py program shown below, which is a program that performs the same thing as the hello program written in Python.



#!/usr/bin/python3

print("hello world")

Let's run this hello.py program through strace.



$ strace -o hello.py.log ./hello.py
hello world

Let's take a look at the trace information.



$ cat hello.py.log
...
write(1, "hello world\n", 12)           = 12   ... (2)
...

Looking at (2), you can see that, just like the hello program, the write() system call is issued. Try writing a hello program equivalent in your favorite language and experiment with various things. Also, it might be interesting to run more complex programs through strace. However, be aware that the output of strace tends to be large, so be careful not to exhaust your file system capacity.

Proportion of time spent processing system calls

You can find out the proportion of instructions executed by the logical CPUs⁴ installed in the system using the sar command. First, let's try to collect information on what kind of processing CPU core 0 is executing using the sar -P 0 1 1 command. The "-P 0" option means to collect data for logical CPU 0, the next "1" means to collect every second, and the last "1" means to collect data only once.



$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         3/26/23  _x86_64_        (8 CPU)

09:51:03     CPU     %user     %nice   %system   %iowait    %steal     %idle ... (1)
09:51:04       0      0.00      0.00      0.00      0.00      0.00    100.00
Average:          0      0.00      0.00      0.00      0.00      0.00    100.00

Let me explain how to read this output. (1) is the header line, and the next line outputs information on how the logical CPU indicated in the second field was used for the purpose from the first field of the header line ("09:51:03") to the first field of the next line ("09:51:04").

There are six types of purpose of CPU usage, from the third field ("%user") to the eighth field ("%idle"), each expressed as a percentage, and their sum equals 100. The proportion of time spent executing processes in user mode is obtained by the sum of "%user" and "%nice" (the difference between "%user" and "%nice" is described in the "Time slice mechanism" column in Chapter 3). "%system" is the proportion of time spent processing system calls by the kernel, and "%idle" shows the proportion of time spent in idle state when nothing is done. We will omit the others here.

In the output above, "%idle" was "100.00". This means that the CPU was doing almost nothing.

Now, let's look at the output of sar while running the inf-loop.py program, which only does an infinite loop, in the background.



#!/usr/bin/python3

while True:
    pass

We will use the taskset command provided by the OS to run the inf-loop.py program on CPU0. By executing taskset -c , you can run the command on a specific CPU specified by the "-c" argument. While running this command in the background, let's collect statistical information using the sar -P 0 1 1 command.



$ taskset -c 0 ./inf-loop.py &
[1] 1911
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         2021年02月27日  _x86_64_        (8 CPU)

09時59分57秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
09時59分58秒       0    100.00      0.00      0.00      0.00      0.00      0.00  ... (1)
Average:          0    100.00      0.00      0.00      0.00      0.00      0.00

From (1), we can see that "%user" was "10"0 because the inf-loop.py program was constantly running on logical CPU0. The state of logical CPU0 at this time is shown below.

When the experiment is over, terminate the inf-loop.py program with kill.



$ kill 1911

Next, let's do the same thing with the syscall-inf-loop.py program, which continuously issues the simple system call getppid() to get the parent process's process ID.



#!/usr/bin/python3

import os

while True:
    os.getppid()



$ taskset -c 0 ./syscall-inf-loop.py &
[1] 2005
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         2021年02月27日  _x86_64_        (8 CPU)

10時03分58秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
10時03分59秒       0     35.00      0.00     65.00      0.00      0.00      0.00  ... (1)
Average:          0     35.00      0.00     65.00      0.00      0.00      0.00

This time, because the system call is constantly being issued, "%system" has increased. The state of the CPU at this time is as follows.

Now that the experiment is over, please terminate the syscall-inf-loop afterwards.

Column: Monitoring, Alerting, and Dashboards

As mentioned earlier, collecting system statistical information using tools like the sar command is crucial for ensuring that the system is functioning as expected. In business systems, it is common to continuously collect such statistical information. This kind of mechanism is called monitoring. Nowadays, for example, Prometheus is one of the attractive options of monitoring tool.

It's difficult for humans to visually monitor statistical information, so it's common to use an alerting function along with monitoring tools. This function allows humans to define in advance what constitutes a normal state and notifies administrators or operators when an anomaly occurs. Alerting tools may be integrated with monitoring tools, but they can also be standalone software, such as Alert Manager.

Ultimately, humans will troubleshoot when the system enters an abnormal state, but examining a list of numbers alone is inefficient for investigation. Therefore, a dashboard feature that visualizes the collected data is also commonly used. This feature can also be integrated with monitoring or alerting tools or used as standalone software, such as Grafana Dashboards.

Duration of System Calls

By adding the "-T" option to strace, you can know how much time was consumed for system calls with microsecond precision. This feature is useful for determining which system calls are taking time when the "%system" is high. The following is the result of running strace -T on the hello program.



$ strace -T -o hello.log ./hello
hello world
$ cat hello.log
...
write(1, "hello world\n", 12)           = 12 <0.000017>
...

In this case, for example, it took 17 microseconds for the process to output the string "hello world\n".

strace also has other options, such as the "-tt" option, which displays the issuance time of system calls in microseconds. Use them as needed, depending on your requirements.

next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

To be precise, programs like firmware and boot loaders run before that. This will be explained in Chapter 2's "Parent-Child Relationship of Processes" section. ↩
In the worst case, the device may be broken and become unusable. Such a device is commonly called "brick". ↩
For example, there are four CPU modes in the x86_64 architecture, but the Linux kernel only uses two of them. ↩
Kernel recognizes a unit as a logical CPU. If there is one core, it corresponds to one CPU; if there is a multicore CPU, it corresponds to one core; and in a system that enables SMT (refer to the "Simultaneous Multi-Threading (SMT)" section in Chapter 8), it indicates a thread within a CPU core. For simplicity in this book, we will use the term "logical CPU". ↩

Various Things About Command-line Arguments for Linux Processes

Satoru Takeuchi — Sat, 25 Mar 2023 08:42:40 +0000

Introduction

When you execute a command like "foo bar baz", the command-line arguments are typically "foo", "bar", and "baz". Although you might think that the arguments are only "bar" and "baz", this is the definition anyway.

In C and C++, command-line arguments can be referenced from the argv array argument of the main function in the program. In the example above, the executable name is stored in argv[0]. The "bar" after that is in argv[1], and "baz" is in argv[2]. The variable equivalent to argv is "$0","$1","$2"... in shell scripts, sys.argv in Python, and os.Args in Go, etc. However, scripts such as shell scripts and Python scripts do not directly expose command-line arguments like C, but show slightly modified ones. This will be discussed later.

About the first element of command-line arguments

The first element of command-line arguments (hereafter referred to as argv[0]) conventionally contains the name of the executable. When executing a program, regardless of the language in which the program is written, the execve() system call shown below is eventually called, specifying the program's executable name in the pathname argument and the command-line arguments in the argv argument.

int execve(const char *pathname, char *const argv[], char *const envp[]);

At this time, following the convention, the same value is specified for pathname and argv[0].

When would you set argv[0] to something other than the name of the executable? For example, when bash is a login shell, argv[0] is not bash but rather "-bash" with a "-" at the beginning. This allows bash to know whether it is a login shell at the time of execution, and to branch its processing accordingly (such as changing the configuration file to be loaded). We will actually verify the value of argv[0] for bash in the next section.

When executing a program, if the executable is run through an interpreter like a bash script, the interpreter's executable name, not the script name, is stored in argv[0]. For example, if there is a bash script called "test.sh", when executing "./test.sh", argv[0] is the executable name of bash, and "./test.sh" is stored in argv[1]. However, this is hard to handle for programmers. So in bash, you can access argv[0] with "$0" and argv[1] with "$1". We will actually verify this in a later section as well.

Verifying the values of a process's command-line arguments using procfs

The command-line arguments of each process can be referenced from /proc/<pid>/cmdline. For example, on the Linux machine where the author is currently logged in via ssh, the command-line arguments for rsyslogd, which collects system logs, were as follows:

sat@tea:~$ pgrep rsyslogd
568
sat@tea:~$ cat /proc/568/cmdline
/usr/sbin/rsyslogd-n-iNONEsat@tea:~

The output looks a bit odd. A command-option-like string is connected after the executable-name-like string "/usr/sbin/rsyslogd". Moreover, there is no newline before the next prompt. This is not because the /proc/<pid>/cmdline outputs all arguments without any delimiters such as " " by design. In fact, each argument is separated by a null character (a byte with a value of 0, or "\0" in C) and bash does not display the null character on the screen. We can use binary dump tools like hexdump to confirm this behavior.

$ hexdump -c /proc/568/cmdline
0000000   /   u   s   r   /   s   b   i   n   /   r   s   y   s   l   o
0000010   g   d  \0   -   n  \0   -   i   N   O   N   E  \0           
000001d

Let's take a look at the argv[0] of bash instances on the system. The last field of the ps ax output shows the command-line arguments separated by spaces, so we'll use this to list the existing bash instances on the system.

sat@tea:~$ ps ax | grep bash
   5239 pts/3    Ss+    0:00 /usr/bin/bash --init-file /home/sat/.vscode-server/bin/74b1f979648cc44d385a2286793c226e611f59e7/out/vs/workbench/contrib/terminal/browser/media/shellIntegration-bash.sh
   8725 pts/4    Ss     0:00 -bash
   8907 pts/4    S      0:00 /bin/bash ./test.sh
   8909 pts/4    S      0:00 /bin/bash ./test.sh
   8929 pts/4    S+     0:00 grep --color=auto bash
for p ax

We can see that the processes with pid 5239, 8725, 8907, and 8909 are bash instances. Among them, the process with pid=8725, where the first character of argv[0] is "-", is the login shell where the author is running the above commands.

Thus, in reality, argv[0] is "/usr/sbin/rsyslogd", argv[1] is "-n", and argv[2] is "-iNONE".

Let's also take a look at an example of a bash script. The script we'll use here outputs "$0" and then sleeps indefinitely.

sat@tea:~$ cat test.sh
#!/bin/bash

echo $0

sleep infinity
sat@tea:~$ ./test.sh
fg
./test.sh # Output of `$0`
^Z
[2]+  Stopped                 ./test.sh
sat@tea:~$ bg
[2]+ ./test.sh &
sat@tea:~$ hexdump -c /proc/8909/cmdline
0000000   /   b   i   n   /   b   a   s   h  \0   .   /   t   e   s   t
0000010   .   s   h  \0                                               
0000014

While the value of "$0" is the script name "./test.sh", the value of argv[0] is "/bin/bash", and the script name is in argv[1]. This means that although it appears to the user as if they are directly executing the "./test.sh" file, the actual program running is bash. Bash interprets and executes the script, mapping argv[1] and beyond to $0 and later variables within the program.

Differences between the command name and command line arguments held by the kernel

Please note that the argv[0] mentioned in this article, which "usually contains the executable file name," is different from the command name seen by the kernel, which is displayed by commands like ps. For more information on the command name from the kernel's perspective, please refer to the following article.

https://dev.to/satorutakeuchi/command-name-from-the-perspective-of-the-linux-kernel-257l

The command name seen by the kernel is the "first 15 bytes of the basename of the executable file name" and is different from argv[0].

Conclusion

I hope this article will reduce confusion about command names, executable file names, command line arguments, and the command line arguments that can be referenced from within the program's source code.

"Command name" from the perspective of the Linux kernel.

Satoru Takeuchi — Fri, 24 Mar 2023 00:28:57 +0000

Introduction

Those of you who use Linux probably execute various commands on Linux on a daily basis. You might use the term "command name" to identify these, but depending on the context, the meaning of this term can vary. This article explains what the Linux kernel considers a command name.

First, I will present a brief conclusion, followed by a detailed explanation, and finally, I will describe the motivation for this investigation and the subsequent research process.

TL;DR

From the Linux kernel's perspective, the command name is the first 15 bytes of the basename of the executable file name (the file name without the directory part).
It is stored as a NULL-terminated string in a 16-byte field called comm within a structure called task_struct, which exists for each process in the kernel's memory (more precisely, for each kernel-level thread).
This enables the kernel to identify processes with low cost and higher readability than using a pid.
This command name is used in kernel logs, by commands such as ps and pgrep, and in packages like procps. Longer command names are truncated due to the 15-byte limit mentioned above.

Investigation Process

Software versions used for the investigation

Linux kernel: v5.15
Procps: 3.3.17

Motivation

The motivation for investigating what was mentioned in the "TL;DR" section came from the fact that the pgrep command I used in my custom program did not work correctly. The pgrep command takes a string specified as an argument as a regular expression and retrieves a list of pids of running processes that match it. For example, below is an example of running an infinitely sleeping script called "foo.sh" and then using pgrep to display its pid.

$ cat foo.sh
#!/bin/bash

sleep infinity
$ ./foo.sh &
[2] 1086408
$ pgrep "foo\.sh"
1086408

However, when I tried the same thing with a script called "foo-bar-baz-hoge-huga.sh" that does exactly the same thing as "foo.sh", grep did not display anything.

$ cat foo-bar-baz-hoge-huga.sh
#!/bin/bash

sleep infinity
$ ./foo-bar-baz-hoge-huga.sh &
[2] 1086868
$ pgrep "foo-bar-baz-hoge-huga\.sh"
$

I thought it was odd, so I looked at man pgrep and found the following description.

The process name used for matching is limited to the 15 characters present in the output of /proc/pid/stat.

In fact, when I looked at the /proc/pid/stat file for "foo-bar-baz-hoge-huga.sh", I got the following output.

$ cat /proc/601235/stat
601235 (foo-bar-baz-hog) S 593786 601235 593786 34817 601419 4194304 224 0 0 0 0 0 0 0 20 0 1 0 5735606 8617984 900 18446744073709551615 94266299658240 94266300571405 140732967030208 0 0 0 65536 4 65538 1 0 0 17 1 0 0 0 0 0 94266300816048 94266300864080 94266304847872 140732967036675 140732967036712 140732967036712 140732967038941 0

The string displayed inside the parentheses in the second field, which shows the command name, did indeed match only the first 15 characters of the script name, not the entire name.

Although I understood the specification itself and realized that my usage of pgrep was incorrect, I decided to verify where this 15-character limit came from.

Reading the procfs Manual

The files under the /proc/ directory are provided by a file system called procfs. Unlike file systems such as ext4 or XFS that manage data on disk, procfs exists for users to obtain kernel information and modify the kernel state through files. We will not go into the details of procfs here.

First, let's check the specifications of the /proc/pid/stat file. The specifications of files under procfs are described in man procfs. The following is an excerpt of the relevant part:

/proc/[pid]/stat
Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.
...
(2) comm %s
The filename of the executable, in parentheses. Strings longer than TASK_COMM_LEN (16) characters (including the terminating null byte) are silently truncated. This is visible
whether or not the executable is swapped out.

We can see that the second field of the /proc/pid/stat file contains the name of the executable file in parentheses, and that any part exceeding 16 bytes, including the NULL terminating string, is ignored. Subtracting 1 byte for the NULL character from 16 bytes gives us 15 bytes, which matches the information written in the pgrep manual.

Identifying the handler for the `/proc/pid/stat` file

Next, I looked at the kernel source to see where this string is actually being output and where the data is stored. The procfs manual states that the /proc/pid/stat file is defined in the fs/proc/array.c file in the kernel source, so I first looked at this file.

The relevant code seems to be in the following part of the do_task_stat() function:

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L562-L564

When the seq_puts() function is called, it outputs the specified string to a file. In the code above, lines 562 and 564 output "(" and ")", and it can be inferred that the command name is probably being output to a file by the proc_task_name() function on line 563.

Before looking at the contents of proc_task_name(), I decided to first check if the do_task_stat() function is actually called when the /proc/pid/stat file is read. I traced the call stack of the do_task_stat() function and found that it is called in sequence from two functions, proc_tid_stat() and proc_tgid_stat().

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L646-L656

In the kernel, tid refers to the thread ID, and tgid refers to the process name, so we can guess that the proc_tgid_stat() function is probably the caller. There are functions that display the state of threads under the /proc/pid/task directory in procfs, so the proc_tid_stat() function is probably the handler for the /proc/pid/task/tid file.

Tracing further back the call stack of these functions, I found that in the proc/pid/base.c file, which registers handlers to be called when users read and write files in procfs, the proc_tgid_stat() function is registered to be called when accessing the /proc/tgid/stat file, or in other words, the /proc/<pid>/stat file.

https://github.com/torvalds/linux/blob/v5.15/fs/proc/base.c#L3168-L3202

In summary, I found the following:

The user reads the /proc/pid/stat file
The proc_tgid_stat() function is called
The do_task_stat() function is called
The proc_task_name() function is called to output the command name to the file

Identifying the source of the command name information

Upon examining the implementation of the proc_task_name() function, it looks like this:

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L99-L112

I will omit the details, but when the process indicated by the pid is a regular program, the evaluation result of the if statement on line 103 is false. This evaluation result is true only in the case of special processes created within the kernel.

Furthermore, since the escape argument of the proc_task_name() function is true when called via the proc_tgid_stat() function, the evaluation result of the if statement on line 108 is true. Therefore, we can see that the data obtained by the __get_task_comm() function (probably a NULL-terminated string) is being used as the output for the /proc/pid/stat file on line 109 within the proc_task_name() function. The seq_escape_str() function on line 109 escapes special characters and spaces, but I will not explain the details here as it is not important for this article.

Now, let's look at the contents of the __get_task_comm() function.

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1209-L1215

We can see that the value of tsk->comm, or more precisely, the value of the comm field of a structure named task_struct, is the source of the command name information. The task_struct structure exists for each thread. Let's take a look at the definition of the task_struct structure.

https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L727-L1063

https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L276-L282

We can see that the comm field is an array of char with a length of 16. The procfs manual also mentioned that the length of TASK_COMM_LENis 16 bytes.

Confirming where the value of task_struct->comm is set
The __set_task_struct() function sets the value of task_struct->comm:

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1223-L1230

The caller of the __set_task_struct() function is the begin_new_exec() function:

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1238-L1357

This function is called when the execve() system call, which creates a new process, is invoked. The bprm->filename contains the name of the executable file corresponding to the process as a NULL-terminated string. Here, we can see that the name of the executable file is processed using the kbasename() function and then saved in task->comm. The kbasename() function, similar to the basename() function in the standard C library, returns a string with the directory part of the file name removed. Therefore, if the executable file name is "./foo.sh", "foo.sh" will be stored in task_struct->comm, and if it's "./foo-bar-baz-hoge-huga.sh", "foo-bar-baz-hog" will be stored. Finally, I understood the definition of the "command name" in the /proc/pid/stat file, or, in other words, as referred to by the Linux kernel.

Examining the procps source code

Lastly, by reading the procps source code, I found out that the string output by pgrep is, as described in the man page, the longest 15 characters excluding the "(" and ")" from the second field of the /proc/pid/stat file.

Since there is nothing particularly interesting going on.

Column: Considering the Definition of Command Names

We now understand that the command name, as referred to by the Linux kernel, is the first 15 bytes of the basename of the executable file. However, why is it processed with the basename, and why is it truncated to a maximum of 15 bytes? The reasons are probably as follows:

To identify processes through kernel logs and other means, it is convenient to have easily accessible information in the form of a string, separate from the process ID (pid). The name of the executable file can be used for this purpose. However, storing the full executable file name in the task_struct structure may consume a large amount of kernel memory and could potentially create a security vulnerability if a malicious user executed a program with an excessively long file name. Therefore, storing the entire file name is not feasible.

One might think that it would be sufficient to look at the value of the executable file name stored in the process memory. However, this is not necessarily true. When accessing the process memory from the kernel, if the relevant memory might be swapped out, it is necessary to swap it back in before reading, which can be cumbersome. Moreover, this approach cannot be used in situations where the system is running out of memory, for instance, when the kernel needs to log the lack of memory. It is not possible to increase memory usage when there is already a shortage.

The reason for using the basename, such as "foo.sh" instead of the file name or full path specified at runtime like "./foo.sh", is likely due to the decision that the basename still provides sufficiently high visibility. In most cases, the basename is enough to recognize and identify the process without using the full path.

Conclusion

In this article, I desceived why the command name specification in the Linux kernel is as it is. Additionally, I wrote about the process of finding answers to small questions that arise while using a computer by reading source code, allowing readers to relive the experience of source code reading. Neither of these provide immediately useful knowledge, but I hope they can serve as tidbits of information.

DEV Community: Satoru Takeuchi

How Linux Works: Chapter 3 Process Scheduler (Part 3)

Context Switch

Performance

NOTE

How Linux Works: Chapter 3 Process Scheduler (Part 2)

Time Slice

Column: How Time Slices Work

NOTE

How Linux Works: Chapter2 Process Management (Part3)

Job control in Shell

Sessions

Process Groups

Daemons

NOTE

How Linux Works: Chapter2 Process Management (Part2)

The Relationship between Parent Process and Child Process

States of Processes

Process Termination

Zombie Processes and Orphan Processes

Signals

Column: The Absolutely Lethal SIGKILL Signal and the Absolutely Indestructible Process

NOTE

How Linux Works: Chapter 3 Process Scheduler (Part 1)

Elapsed Time and CPU Time

When Using Only One Logical CPU

When Using Multiple Logical CPUs

Cases where "user + sys" is larger than "real"

NOTE

How Linux Works: Chapter2 Process Management (Part1)

Process Creation

The fork() function that splits the same process into two

The execve() function for launching a different program

Security enhancement with Address Space Layout Randomization feature

Column: Process Generation Methods Other Than fork() and execve()

NOTE

ATARI is still alive: Atari Partition of Fear

Introduction

Glossary

Problem Summary

Mechanism

Details

Configuring OSD on device in Rook

OSD Formats in Ceph

Atari Partition Recognition Method in the Linux Kernel

The process leading to the problem

History of Handling the Issue

Conclusion

How Linux Works: Chapter1 Linux Overview (Part2)

Libraries

Wrapper Functions for System Calls

Static Libraries and Shared Libraries

Column: The Revival of Static Linking

NOTE

Should we always use quicksort than insertion sort?

How Linux Works: Chapter1 Linux Overview (Part1)

Programs and Processes

Kernel

System Calls

Visualizing System Call Invocations

Proportion of time spent processing system calls

Column: Monitoring, Alerting, and Dashboards

Duration of System Calls

NOTE

Various Things About Command-line Arguments for Linux Processes

Introduction

About the first element of command-line arguments

Verifying the values of a process's command-line arguments using procfs

Differences between the command name and command line arguments held by the kernel

Conclusion

"Command name" from the perspective of the Linux kernel.

Introduction

TL;DR

Investigation Process

Motivation

Reading the procfs Manual

Identifying the handler for the /proc/pid/stat file

Identifying the source of the command name information

Examining the procps source code

Column: Considering the Definition of Command Names

The `execve()` function for launching a different program

Column: Process Generation Methods Other Than `fork()` and `execve()`

Identifying the handler for the `/proc/pid/stat` file