Satoru Takeuchi

Posted on Oct 4, 2023 • Edited on Jul 27, 2024

How Linux Works: Chapter 3 Process Scheduler (Part 2)

#linux

Time Slice

In the previous section, we found out that the number of processes that can run simultaneously on one CPU is only one. However, we didn't learn how the CPU resources are distributed from the experiment in the previous section. Therefore, in this section, we will confirm through experiments that the scheduler allows executable processes to use the CPU in time slice units.

We will use a program called sched.py for the experiment.

#!/usr/bin/python3

import sys
import time
import os
import plot_sched

def usage():
    print("""Usage: sched.py <number of processes>
        * After starting <number of processes> load processing processes on logical CPU0, which consume CPU resources for about 100 milliseconds simultaneously, wait for all processes to end.
        * Write out a graph showing the execution results to a file named "sched-<number of processes>.jpg".
        * The x-axis of the graph represents elapsed time [milliseconds] from the start of the load processing process, and the y-axis represents progress [%].""".format(progname, file=sys.stderr))
    sys.exit(1)

# Find the appropriate load for this experimentation.
# This estimation is expected to take several seconds.
# Please increment/decrement NLOOP_FOR_ESTIMATION if this process takes too long/short time.
NLOOP_FOR_ESTIMATION=100000000
nloop_per_msec = None
progname = sys.argv[0]

def estimate_loops_per_msec():
    before = time.perf_counter()
    for _ in  range(NLOOP_FOR_ESTIMATION):
        pass
    after = time.perf_counter()
    return int(NLOOP_FOR_ESTIMATION/(after-before)/1000)

def child_fn(n):
    progress = 100*[None]
    for i in range(100):
        for j in range(nloop_per_msec):
            pass
        progress[i] = time.perf_counter()
    f = open("{}.data".format(n),"w")
    for i in range(100):
        f.write("{}\t{}\n".format((progress[i]-start)*1000,i))
    f.close()
    exit(0)

if len(sys.argv) < 2:
    usage()

concurrency = int(sys.argv[1])

if concurrency < 1:
    print("<the number of processes> should be >= 1: {}".format(concurrency))
    usage()

# Force to run on logical CPU0
os.sched_setaffinity(0, {0})

nloop_per_msec = estimate_loops_per_msec()

start = time.perf_counter()

for i in range(concurrency):
    pid = os.fork()
    if (pid < 0):
        exit(1)
    elif pid == 0:
        child_fn(i)

for i in range(concurrency):
    os.wait()

plot_sched.plot_sched(concurrency)

This program continuously runs one or more load-processing processes that use CPU time and collects the following statistical information:

At a certain point, which process is running on the logical CPU
How much progress each one has made

By analyzing this data, we will check whether the explanation of the scheduler given at the beginning is correct. The specification of the experimental program sched.py is as follows.

Usage: sched <number of processes>
        * After starting <number of processes> load processing processes on logical CPU0, which consume CPU resources for about 100 milliseconds simultaneously, wait for all processes to end.
        * Write out a graph showing the execution results to a file named "sched-<number of processes>.jpg".
        * The x-axis of the graph represents elapsed time [milliseconds] from the start of the load processing process, and the y-axis represents progress [%].

The file plot_sched.py is also used for graph drawing, so if you are running the sched program, please place plot_sched.py in the same directory.

#!/usr/bin/python3

import numpy as np
from PIL import Image
import matplotlib
import os

matplotlib.use('Agg')

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = "sans-serif"
plt.rcParams['font.sans-serif'] = "TakaoPGothic"

def plot_sched(concurrency):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    for i in range(concurrency):
        x, y = np.loadtxt("{}.data".format(i), unpack=True)
        ax.scatter(x,y,s=1)
    ax.set_title("Visualize timeslice(concurrency={})".format(concurrency))
    ax.set_xlabel("elapsed time[ms]")
    ax.set_xlim(0)
    ax.set_ylabel("progress[%]")
    ax.set_ylim([0,100])
    legend = []
    for i in range(concurrency):
        legend.append("load"+str(i))
    ax.legend(legend)

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "sched-{}.png".format(concurrency)
    jpgfilename = "sched-{}.jpg".format(concurrency)
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

def plot_avg_tat(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, y, _ = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("concurrency")
    ax.set_ylim(0)
    ax.set_ylabel("average turn around time[秒]")

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "avg-tat.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

def plot_throughput(max_nproc):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    x, _, y = np.loadtxt("cpuperf.data", unpack=True)
    ax.scatter(x,y,s=1)
    ax.set_xlim([0, max_nproc+1])
    ax.set_xlabel("concurrency")
    ax.set_ylim(0)
    ax.set_ylabel("throughput[process/秒]")

    # Save the image as png temporarily and convert this to jpg to avoid a bug in matplotlib exists in Ubuntu 20.04.
    # https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
    pngfilename = "avg-tat.png"
    jpgfilename = "throughput.jpg"
    fig.savefig(pngfilename)
    Image.open(pngfilename).convert("RGB").save(jpgfilename)
    os.remove(pngfilename)

This program will be executed with parallelism of 1, 2, and 3, respectively.

for i in 1 2 3 ; do ./sched $i ; done

The results are shown in the following figures.

These graphs show that when multiple processes are running on a single logical CPU, each process is using the CPU alternately in time slices of a few milliseconds.

Column: How Time Slices Work

Looking closely at Figure XX, you can see that each process's time slice is shorter when is 3 compared to when it's 2. In fact, Linux's scheduler is designed to ensure that CPU time is obtained once per period, called a latency target, which is indicated by the value of the sysctl parameter kernel.sched_latency_ns (in nanoseconds).

In the author's environment, this parameter is set to the following value:

$ sysctl kernel.sched_latency_ns
kernel.sched_latency_ns = 24000000  # 24000000/1000000 = 24 milliseconds

The time slice for each process is kernel.sched_latency_ns / [nanoseconds].

The relationship between the latency target and the time slice when there are 1 to 3 executable processes on a logical CPU is shown in the following figure.

In the scheduler of the Linux kernel before 2.6.23, the time slice was a fixed value (100 milliseconds), but this posed a problem as CPU time wouldn't efficiently rotate between processes when the number of processes increased. To improve this issue, the current scheduler adjusts the time slice according to the number of processes.

The calculation of the latency target and time slice values becomes slightly more complicated with

increasing numbers of processes or in the case of multicore CPUs, varying depending on elements such as:

The number of logical CPUs equipped in the system
The number of processes running/waiting on logical CPUs exceeding a certain value
The nice value that represents the priority of the process

In this section, we will discuss the effect of the nice value. The nice value is a setting that defines the execution priority of a process within a range from "-20" to "19" (the default is "0"). -20 is the highest priority and 19 is the lowest. Any user can lower the priority, but only users with root privileges can raise it.

The nice value can be changed using the nice command, renice command, nice() system call, setpriority() system call, and so on. The scheduler gives more time slices to processes with a low nice value (i.e., a high priority).

Let's try running the nice sched_nice.py p with the following specifications:

Usage: sched_nice.py <nice value>
        * After launching two load processing processes that consume about 100 milliseconds of CPU resources on logical CPU0, wait for both processes to finish.
        * The nice values of load processes 0 and 1 are set to 0 (default) and <nice value>, respectively.
        * Writes a graph showing the execution result to a file named "sched-2.jpg".
        * The x-axis of the graph represents the elapsed time from the start of the process [milliseconds], and the y-axis represents the progress [%].

The source code is here.

#!/usr/bin/python3

import sys
import time
import os
import plot_sched

def usage():
    print("""Usage: sched_nice.py <nice value>
        * After launching two load processing processes that consume about 100 milliseconds of CPU resources on logical CPU0, wait for both processes to finish.
        * The nice values of load processes 0 and 1 are set to 0 (default) and <nice value>, respectively.
        * Writes a graph showing the execution result to a file named "sched-2.jpg".
        * The x-axis of the graph represents the elapsed time from the start of the process [milliseconds], and the y-axis represents the progress [%].""".format(progname, file=sys.stderr))
    sys.exit(1)

# Find the appropriate load for this experimentation.
# This estimation is expected to take several seconds.
# Please increment/decrement NLOOP_FOR_ESTIMATION if this process takes too long/short time.
NLOOP_FOR_ESTIMATION=100000000
nloop_per_msec = None
progname = sys.argv[0]

def estimate_loops_per_msec():
    before = time.perf_counter()
    for _ in  range(NLOOP_FOR_ESTIMATION):
        pass
    after = time.perf_counter()
    return int(NLOOP_FOR_ESTIMATION/(after-before)/1000)

def child_fn(n):
    progress = 100*[None]
    for i in range(100):
        for _ in range(nloop_per_msec):
            pass
        progress[i] = time.perf_counter()
    f = open("{}.data".format(n),"w")
    for i in range(100):
        f.write("{}\t{}\n".format((progress[i]-start)*1000,i))
    f.close()
    exit(0)

if len(sys.argv) < 2:
    usage()

nice = int(sys.argv[1])
concurrency = 2

if concurrency < 1:
    print("<the number of processes> should be >= 1: {}".format(concurrency))
    usage()

# Force running on logical CPU0
os.sched_setaffinity(0, {0})

nloop_per_msec = estimate_loops_per_msec()

start = time.perf_counter()

for i in range(concurrency):
    pid = os.fork()
    if (pid < 0):
        exit(1)
    elif pid == 0:
        if i == concurrency - 1:
            os.nice(nice)
        child_fn(i)

for i in range(concurrency):
    os.wait()

plot_sched.plot_sched(concurrency)

Here, let's specify 5 for .

$ ./sched-nice 5

The results are shown in the following graph.

As expected, you can see that load process 0 has more time slices than load process 1.

By the way, the "%nice" field in the output of sar shows the proportion of time that processes with a priority lower than the default value of 0 are executing in user mode (%user is for nice value 0). Let's run the inf-loop program we used in Chapter XX with a lowered priority (we'll set it to 5 here), and check the CPU usage with sar at that time.

$ nice -n 5 taskset -c 0 ./inf-loop &
[1] 168376
$ sar -P 0 1 1
Linux 5.4.0-74-generic (coffee)         2021年12月04日  _x86_64_        (8 CPU)

05時57分58秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
05時57分59秒       0      0.00    100.00      0.00      0.00      0.00      0.00
Average:          0      0.00    100.00      0.00      0.00      0.00      0.00
$ kill 168376

You can see that it's not %user, but %nice that has reached 100.

Note that what we have discussed in this column is not defined by standards such as POSIX, so it may change as the kernel version changes. For example, the default value of kernel.sched_latency_ns has been changed many times in the past. Please be aware that even if you tune your system depending on the behavior mentioned here, it may not necessarily be effective in the future.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

DEV Community

How Linux Works: Chapter 3 Process Scheduler (Part 2)

Time Slice

Column: How Time Slices Work

NOTE

Top comments (0)

Read next

Arbitrum One vs Arbitrum Nova: Navigating the Future of Ethereum Scaling

"Unlocking Efficiency: LServe's Breakthrough in Long-Sequence LLMs"

Game Dev Digest — Issue #270 - Design Docs, Project Architecture, and more

Apache ShardingSphere: Effortless Database Scaling and Beyond!