DEV Community: Wenbo Zhang

Linux Kernel vs. Memory Fragmentation (Part II)

Wenbo Zhang — Wed, 26 May 2021 05:52:28 +0000

In Linux Kernel vs. Memory Fragmentation (Part I), I concluded that grouping by migration types only delays memory fragmentation, but does not fundamentally solve it. As the memory fragmentation increases and it does not have enough contiguous physical memory, performance degrades.

Therefore, to mitigate the performance degradation, the Linux kernel community introduced memory compaction to the kernel.

In this post, I'll explain the principle of memory compaction, how to view the fragmentation index, and how to quantify the latency overheads caused by memory compaction.

Memory compaction

Before memory compaction, the kernel used lumpy reclaim for defragmentation. However, this feature was removed from v3.10 (currently the most widely used kernel version). If you'd like to learn more, you can read about lumpy reclaim in the articles I listed in A brief history of defragmentation. For now, let me bring your mind to memory compaction.

Algorithm introduction

The article Memory compaction on LWN.net explains the algorithmic idea of memory compaction in detail. You can take the following fragmented zone as a simple example:

A small fragmented memory zone - LWN.net

The white boxes are free pages, while those in red are allocated pages.

Memory compaction for this zone breaks down into three major steps:

Scan this zone from left to right for red pages of the MIGRATE_MOVABLE migration type.

Search for movable pages
At the same time, scan this zone from right to left for free pages.

Search for free pages
Shift movable pages at the bottom to free pages at the top, thus creating a contiguous chunk of free space.

The memory zone after memory compaction

This principle seems relatively simple, and the kernel also provides /proc/sys/vm/compact_memory as the interface for manually triggering memory compaction.

However, as mentioned in Part I and Memory compaction issues, memory compaction is not very efficient in practice—at least for the most commonly-used kernel, v3.10—no matter whether it is triggered automatically or manually. Due to the high overhead it causes, it becomes a performance bottleneck instead.

The open source community did not abandon this feature but continued to optimize it in subsequent versions. For example, the community introduced kcompactd to the kernel in v4.6 and made direct compaction more deterministic in v4.8.

When memory compaction is performed

In kernel v3.10, memory compaction is performed under any of the following situations:

The kswapd kernel thread is called to balance zones after a failed high-order allocation.
The khugepaged kernel thread is called to collapse a huge page.
Memory compaction is manually triggered via the /proc interface.
The system performs direct reclaim to meet higher-order memory requirements, including handling Transparent Huge Page (THP) page fault exceptions.

In Why We Disable Linux's THP Feature for Databases, I described how THP slows down performance and recommended disabling this feature. I will put it aside in this article and mainly focus on the memory allocation path.

Memory allocation in the slow path

When the kernel allocates pages, if there are no available pages in the free lists of the buddy system, the following occurs:

The kernel processes this request in the slow path and tries to allocate pages using the low watermark as the threshold.
If the memory allocation fails, which indicates that the memory may be slightly insufficient, the page allocator wakes up the kswapd thread to asynchronously reclaim pages and attempts to allocate pages again, also using the low watermark as the threshold.
If the allocation fails again, it means that the memory shortage is severe. In this case, the kernel runs asynchronous memory compaction first.
If the allocation still does not succeed after the async memory compaction, the kernel directly reclaims memory.
After the direct memory reclaim, if the kernel doesn't reclaim enough pages to meet the demand, it performs direct memory compaction. If it doesn't reclaim a single page, the OOM Killer is called to deallocate memory.

The above steps are only a simplified description of the actual workflow. In real practice, it is more complicated and will be different depending on the requested memory order and allocation flags.

As for direct memory reclaim, it is not only performed by the kernel when the memory is severely insufficient, but also triggered due to memory fragmentation in practical scenarios. In a certain period, these two situations may occur simultaneously.

How to analyze memory compaction

Quantify the performance latency

As mentioned in the previous section, the kernel may perform memory reclaim or memory compaction when allocating memory. To make it easier to quantify the latency caused by direct memory reclaim and memory compaction for each participating thread, I committed two tools, drsnoop and compactsnoop, to the BCC project.

Both tools are based on kernel events and come with detailed documentation, but there is one thing I want to note: to reduce the cost of introducing Berkeley Packet Filters (BPF), these two tools capture the latency of each corresponding event. Therefore, you may see from the output that each memory request corresponds to multiple latency results.

The reason for the many-to-one relationship is that, for older kernels like v3.10, it is uncertain how many times the kernel will try to allocate during a memory allocation process in the slow path. The uncertainty also makes OOM Killer start to work either too early or too late, resulting in most tasks on the server being hung up for a long time.

After the kernel merged the patch mm: fixed 100% CPU kswapd busyloop on unreclaimable nodes in v4.12, the maximum number of direct memory reclaims is limited to 16. Let's assume that the average latency of a direct memory reclaim is 10 ms. (Shrinking active or inactive LRU chain tables is time consuming for today's servers with several hundred gigabytes of RAM. There is also an additional delay if the server needs to wait for a dirty page to be written back.)

If a thread asks the page allocator for pages and gets enough memory after only one direct memory reclaim, the latency of this allocation increases by 10 ms. If the kernel tries 16 times before reclaiming enough memory spaces, then the increased latency of this allocation is 160 ms instead of 10 ms, which may severely degrade performance.

View the fragmentation index

Let's come back to memory compaction. There are four main steps for the core logic of memory compaction:

Determine whether a memory zone is suitable for memory compaction.
Set the starting page frame number for scanning.
Isolate pages of the MIGRATE_MOVABLE type.
Migrate pages of the MIGRATE_MOVABLE type to the top of the zone.

If the zone still needs compaction after one migration, the kernel loops the above process for three to four times until the compaction is finished. This operation consumes a lot of CPU resources; therefore, you can often see from the monitoring that the system CPU usage is full.

Well then, how does the kernel determine whether a zone is suitable for memory compaction?

If you use the /proc/sys/vm/compact_memory interface to forcibly require memory compaction for a zone, there is no need for the kernel to determine it.

If memory compaction is automatically triggered, the kernel calculates the fragmentation index of the requested order to determine whether the zone has enough memory left for compaction. The closer the index is to 0, the more the memory allocation is likely to fail due to insufficient memory. This means memory reclaim is more suitable than memory compaction at this time. The closer the index is to 1,000, the more the allocation is likely to fail due to excessive external fragmentation. Therefore, in this situation, it is appropriate to do memory reclaim, not memory compaction.

Whether the kernel chooses to perform memory compaction or memory reclaim is determined by the external fragmentation threshold. You can view this threshold through the /proc/sys/vm/extfrag_threshold interface.

You can execute cat /sys/kernel/debug/extfrag/extfrag_index to directly view the fragmentation index through /sys/kernel/debug/extfrag/extfrag_index. Note that the following screen shot results are divided by 1,000:

Pros and cons

Both the monitoring interfaces based on the /proc file system and the tools based on kernel events (drsnoop and compactsnoop) can be used to analyze memory compaction, but with different pros and cons.

The monitoring interfaces are simple to use, but they cannot perform quantitative analysis on the latency results, and the sampling period is too long. The tools based on kernel events can solve these problems, but you need a certain understanding of the working principle of the kernel-related subsystems to use these tools, and there are certain requirements for the client's kernel version.

Therefore, the monitoring interfaces and the kernel-events-based tools actually complement each other. Using them together can help you to analyze memory compaction thoroughly.

How to mitigate memory fragmentation

The kernel is designed to take care of slow backend devices. For example, it implements the second chance method and the refault distance based on the LRU algorithm and does not support limiting the percentage of page cache. Some companies used to customize their own kernel to limit the page cache and tried to submit it to the upstream kernel community, but the community did not accept it. I think it may be because this feature causes problems such as working set refaults.

Therefore, to reduce the frequency of direct memory reclaim and mitigate fragmentation issues, it is a good choice to increase vm.min_free_kbytes (up to 5% of the total memory). This indirectly limits the percentage of page cache for scenarios with a lot of I/O operations, and the machine has more than 100 GB of memory.

Although setting vm.min_free_kbytes to a bigger value wastes some memory, it is negligible. For example, if a server has 256 GB memory and you set vm.min_free_kbytes to "4G", it only takes 1.5% of the total memory space.

The community apparently noticed the waste of memory as well, so the kernel merged the patch mm: scale kswapd watermarks in proportion to memory in v4.6 to optimize it.

Another solution is to execute drop cache at the right time, but it may cause more jitter to the application performance.

Conclusion

In Part I of this post series, I briefly explained why the external fragmentation affects performance and introduced the efforts the community has made over the years in defragmentation. Here in Part II, I've focused on the defragmentation principles in the kernel v3.10 and how to observe memory fragmentation quantitatively and qualitatively.

I hope this post series will be helpful for you! If you have any other thoughts about Linux memory management, welcome to join the TiDB Community Slack workspace to share and discuss with us.

Originally pulished at Linux Kernel vs. Memory Fragmentation (Part II)

Linux Kernel vs. Memory Fragmentation (Part I)

Wenbo Zhang — Thu, 01 Apr 2021 06:31:13 +0000

(External) memory fragmentation is a long-standing Linux kernel programming issue. As the system runs, it assigns various tasks to memory pages. Over time, memory gets fragmented, and eventually, a busy system that is up for a long time may have only a few contiguous physical pages.

Because the Linux kernel supports virtual memory management, physical memory fragmentation is often not an issue. With page tables, unless large pages are used, physically scattered memory is still contiguous in the virtual address space.

However, it becomes very difficult to allocate contiguous physical memory from the kernel linear mapping area. For example, it is challenging to allocate structure objects through the block allocator—a common and frequent operation in the kernel mode—or operate on a Direct Memory Access (DMA) buffer that does not support the scatter and gather modes. Such operations might cause frequent direct memory reclamation or compaction, resulting in large fluctuations in system performance, or allocation failure. In slow memory allocation paths, different operations are performed according to the page allocation flag.

If the kernel programming no longer relies on the high-order physical memory allocation in the linear address space, the memory fragmentation issue will be solved. However, for a huge project like the Linux kernel, it isn't practical to make such changes.

Since Linux 2.x, the open source community has tried several methods to alleviate the memory fragmentation issue, including many effective, but unusual patches. Some merged patches have been controversial, such as the memory compaction mechanism. At the LSFMM 2014 conference, many developers complained that memory compaction was not very efficient and that bugs were not easy to reproduce. But the community did not abandon the feature and continued to optimize it in subsequent versions.

Mel Gorman is the most persistent contributor in this field. He has submitted two sets of important patches. The first set was merged in Linux 2.6.24 and iterated over 28 versions before the community accepted it. The second set was merged in Linux 5.0 and successfully reduced memory fragmentation events by 94% on one- or two-socket machines.

In this post, I'll introduce some common extensions to the buddy allocator that helps prevent memory fragmentation in the Linux 3.10 kernel, the principle of memory compaction, how to view the fragmentation index, and how to quantify the latency overheads caused by memory compaction.

A brief history of defragmentation

Before I start, I want to recommend some good reads. The following articles show you all the efforts of improving high-level memory allocation during Linux kernel development.

Publish date	Articles on LWN.net
2004-09-08	Kswapd and high-order allocations
2004-05-10	Active memory defragmentation
2005-02-01	Yet another approach to memory fragmentation
2005-11-02	Fragmentation avoidance
2005-11-08	More on fragmentation avoidance
2006-11-28	Avoiding - and fixing - memory fragmentation
2010-01-06	Memory compaction
2014-03-26	Memory compaction issues
2015-07-14	Making kernel pages movable
2016-04-23	CMA and compaction
2016-05-10	Make direct compaction more deterministic
2017-03-21	Proactive compaction
2018-10-31	Fragmentation avoidance improvements
2020-04-21	Proactive compaction for the kernel

Now, let‘s get started.

Linux buddy memory allocator

Linux uses the buddy algorithm as a page allocator, which is simple and efficient. Linux has made some extensions to the classic algorithm:

Partitions' buddy allocator
Per-CPU pageset
Group by migration types

The Linux kernel uses node, zone, and page to describe physical memory. The partitions' buddy allocator focuses on a certain zone on a certain node.

Before the 4.8 version, the Linux kernel implemented the page recycling strategy based on zone, because the early design was mainly for 32-bit processors, and there was a lot of high memory. However, the page aging speed of different zones on the same node was inconsistent, which caused many problems.

Over a long period, the community has added a lot of tricky patches, but the problem has remained. With more 64-bit processors and large memory models being used in recent years, Mel Groman migrated the page recycling strategy from zone to node and solved this problem. If you use Berkeley Packet Filter (BPF) authoring tools to observe recycling operations, you need to know this.

The per-CPU pageset optimizes single page allocation, which reduces lock contention between processors. It has nothing to do with defragmentation.

Grouping by migration types is the defragmentation method I'll introduce in detail.

Group by migration types

First, you need to understand the memory address space layout. Each processor architecture has a definition. For example, the definition of x86_64 is in mm.txt.

Because the virtual address and physical address are not linearly mapped, accessing the virtual address space through the page table (such as the heap memory requirement of the user space) does not require contiguous physical memory. Take the Intel 5-level page table in the following figure as an example. The virtual address is divided from low to high:

Low: Page offset
Level 1: Direct page table index
Level 2: Page middle directory index
Level 3: Page upper directory index
Level 4: Page 4-level directory index
Level 5: Page global index

Intel 5-level paging - Wikipedia

The page frame number of the physical memory is stored in the direct page table entry, and you can find it through the direct page table index. The physical address is the combination of the found page frame number and the page offset.

Suppose you want to change the corresponding physical page in a direct page table entry. You only need to:

Allocate a new page.
Copy the data of the old page to the new one.
Modify the value of the direct page table entry to the new page frame number.

These operations do not change the original virtual address, and you can migrate such pages at will.

For the linear mapping area, the virtual address equals the physical address plus the constant. Modifying the physical address changes the virtual address, and accessing the original virtual address causes a bug. Therefore, it is not recommended to migrate these pages.

When the physical pages accessed through the page table and the pages accessed through linear mapping are mixed and managed together, memory fragmentation is prone to occur. Therefore, the kernel defines several migration types based on the mobility of the pages and groups the pages by the migration types for defragmentation.

Among the defined migration types, the three most frequently used are: MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, and MIGRATE_RECLAIMABLE. Other migration types have special purposes, which I won't describe here.

You can view the distribution of each migration type at each stage through /proc/pagetypeinfo:

When applying for a page, the page allocation flag you use determines the specific migration type from which the page is allocated. For example, you can use __GFP_MOVABLE for user space memory, and __GFP_RECLAIMABLE for file pages.

When pages of a certain migration type are used up, the kernel steals physical pages from other migration types. To avoid fragmentation, the page stealing starts from the largest page block. The page block size is determined by pageblock_order.

The standby priorities of the above three migration types from high to low are:

MIGRATE_UNMOVABLE: MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE MIGRATE_RECALIMABlE: MIGRATE_UNMOVABLE, MIGRATE_MOVABLE MIGRATE_MOVABLE: MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE

The kernel introduces grouping by migration types for defragmentation. But frequent page stealing indicates that there are external memory fragmentation events, and they might cause trouble in the future.

Analyze external memory fragmentation events

My previous article Why We Disable Linux's THP Feature for Databases mentioned that you can use ftrace events provided by the kernel to analyze external memory fragmentation events. The procedure is as follows:

Enable the ftrace events:

echo 1> /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable

Start collecting the ftrace events:

cat /sys/kernel/debug/tracing/trace_pipe> ~/extfrag.log

Tap Ctrl-C to stop collecting. A event contains many fields:

To analyze the number of external memory fragmentation events, focus on the events with fallback_order < pageblock order. In the x86_64 environment, pageblock order is 9.

Clean up the events:

echo 0> /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable

You can see that grouping by migration types only delays memory fragmentation, but does not fundamentally solve it.

As the memory fragmentation increases and it does not have enough contiguous physical memory, performance degrades. So, it's not enough to apply this feature alone.

In my next article, I'll introduce more methods that the kernel uses to regulate memory fragmentation.

To be continued…

Originally pulished at Linux Kernel vs. Memory Fragmentation (Part I).

How to Trace Linux System Calls in Production with Minimal Impact on Performance

Wenbo Zhang — Thu, 21 Jan 2021 15:44:56 +0000

If you need to dynamically trace Linux process system calls, you might first consider strace. strace is simple to use and works well for issues such as "Why can't the software run on this machine?" However, if you're running a trace in a production environment, strace is NOT a good choice. It introduces a substantial amount of overhead. According to a performance test conducted by Arnaldo Carvalho de Melo, a senior software engineer at Red Hat, the process traced using strace ran 173 times slower, which is disastrous for a production environment.

So are there any tools that excel at tracing system calls in a production environment? The answer is YES. This blog post introduces perf and traceloop, two commonly used command-line tools, to help you trace system calls in a production environment.

perf, a performance profiler for Linux

perf is a powerful Linux profiling tool, refined and upgraded by Linux kernel developers. In addition to common features such as analyzing Performance Monitoring Unit (PMU) hardware events and kernel events, perf has the following subcomponents:

sched: Analyzes scheduler actions and latencies.
timechart: Visualizes system behaviors based on the workload.
c2c: Detects the potential for false sharing. Red Hat once tested the c2c prototype on a number of Linux applications and found many cases of false sharing and cache lines on hotspots.
trace: Traces system calls with acceptable overheads. It performs only 1.36 times slower with workloads specified in the dd command.

Let's look at some common uses of perf.

To see which commands made the most system calls:
```
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
```
System call counts
From the output, you can see that the kube-apiserver command had the most system calls during sampling.
To see system calls that have latencies longer than a specific duration. In the following example, this duration is 200 milliseconds:
```
perf trace --duration 200
```
System calls longer than 200 ms
From the output, you can see the process names, process IDs (PIDs), the specific system calls that exceed 200 ms, and the returned values.
To see the processes that had system calls within a period of time and a summary of their overhead:
```
perf trace -p $PID  -s
```
System call overheads by process
From the output, you can see the times of each system call, the times of the errors, the total latency, the average latency, and so on.
To analyze the stack information of calls that have a high latency:
```
perf trace record --call-graph dwarf -p $PID -- sleep 10
```
Stack information of system calls with high latency
To trace a group of tasks. For example, two BPF tools are running in the background. To see their system call information, you can add them to a perf_event cgroup and then execute per trace:
```
mkdir /sys/fs/cgroup/perf_event/bpftools/
echo 22542 >> /sys/fs/cgroup/perf_event/bpftools/tasks
echo 20514 >> /sys/fs/cgroup/perf_event/bpftools/tasks
perf trace -G bpftools -a -- sleep 10
```
Trace a group of tasks

Those are some of the most common uses of perf. If you'd like to know more (especially about perf-trace), see the Linux manual page. From the manual pages, you will learn that perf-trace can filter tasks based on PIDs or thread IDs (TIDs), but that it has no convenient support for containers and the Kubernetes (K8s) environments. Don't worry. Next, we'll discuss a tool that can easily trace system calls in containers and in K8s environments that uses cgroup v2.

Traceloop, a performance profiler for cgroup v2 and K8s

Traceloop provides better support for tracing Linux system calls in the containers or K8s environments that use cgroup v2. You might be unfamiliar with traceloop but know BPF Compiler Collection (BCC) pretty well. (Its front-end is implemented using Python or C++.) In the IO Visor Project, BCC's parent project, there is another project named gobpf that provides Golang bindings for the BCC framework. Based on gobpf, traceloop is developed for environments of containers and K8s. The following illustration shows the traceloop architecture:

traceloop architecture

We can further simplify this illustration into the following key procedures. Note that these procedures are implementation details, not operations to perform:

bpf helper gets the cgroup ID. Tasks are filtered based on the cgroup ID rather than on the PID and TID.
Each cgroup ID corresponds to a bpf tail call that can call and execute another eBPF program and replace the execution context. Syscall events are written through a bpf tail call to a perf ring buffer with the same cgroup ID.
The user space reads the perf ring buffer based on this cgroup ID.

Note:

Currently, you can get the cgroup ID only by executing bpf helper: bpf_get_current_cgroup_id, and this ID is available only in cgroup v2. Therefore, before you use traceloop, make sure that cgroup v2 is enabled in your environment.

In the following demo (on the CentOS 8 4.18 kernel), when traceloop exits, the system call information is traced:

sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service

traceloop tracing system calls

As the results show, the traceloop output is similar to that of strace or perf-trace except for the cgroup-based task filtering. Note that CentOS 8 mounts cgroup v2 directly on the /sys/fs/cgroup path instead of on /sys/fs/cgroup/unified as Ubuntu does. Therefore, before you use traceloop, you should run mount -t cgroup2 to determine the mount information.

The team behind traceloop has integrated it with the Inspektor Gadget project, so you can run traceloop on the K8s platform using kubectl. See the demos in Inspektor Gadget - How to use and, if you like, try it on your own.

Benchmark with system calls traced

We conducted a sysbench test in which system calls were either traced using multiple tracers (traceloop, strace, and perf-trace) or not traced. The benchmark results are as follows:

Sysbench results with system calls traced and untraced

As the benchmark shows, strace caused the biggest decrease in application performance. perf-trace caused a smaller decrease, and traceloop caused the smallest.

Summary of Linux profilers

For issues such as "Why can't the software run on this machine," strace is still a powerful system call tracer in Linux. But to trace the latency of system calls, the BPF-based perf-trace is a better option. In containers or K8s environments that use cgroup v2, traceloop is the easiest to use.

Originally pulished at How to Trace Linux System Calls in Production with Minimal Impact on Performance.

Tips and Tricks for Writing Linux BPF Applications with libbpf

Wenbo Zhang — Tue, 12 Jan 2021 15:04:08 +0000

At the beginning of 2020, when I used the BCC tools to analyze our database performance bottlenecks, and pulled the code from the GitHub, I accidentally discovered that there was an additional libbpf-tools directory in the BCC project. I had read an article on BPF portability and another on BCC to libbpf conversion, and I used what I learned to convert my previously submitted bcc-tools to libbpf-tools. I ended up converting nearly 20 tools. (See Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis.)

During this process, I was fortunate to get a lot of help from Andrii Nakryiko (the libbpf + BPF CO-RE project's leader). It was fun and I learned a lot. In this post, I'll share my experience about writing Berkeley Packet Filter (BPF) applications with libbpf. I hope this article is helpful to people who are interested in libbpf and inspires them to further develop and improve BPF applications with libbpf.

Before you read further, however, consider reading these posts for important background information:

This article assumes that you've already read these posts, so there won't be any systematic descriptions. Instead, I'll offer you some tips for certain parts of the program.

Program skeleton

Combining the open and load phases

If your BPF code doesn't need any runtime adjustments (for example, adjusting the map size or setting an extra configuration), you can call <name>__open_and_load() to combine the two phases into one. This makes our code look more compact. For example:

obj = readahead_bpf__open_and_load();
if (!obj) {
        fprintf(stderr, "failed to open and/or load BPF object\n");
        return 1;
}
err = readahead_bpf__attach(obj);

You can see the complete code in readahead.c.

Selective attach

By default, <name>__attach() attaches all auto-attachable BPF programs. However, sometimes you might want to selectively attach the corresponding BPF program according to the command line parameters. In this case, you can call bpf_program__attach() instead. For example:

err = biolatency_bpf__load(obj);
[...]
if (env.queued) {
        obj->links.block_rq_insert =
                bpf_program__attach(obj->progs.block_rq_insert);
        err = libbpf_get_error(obj->links.block_rq_insert);
        [...]
}
obj->links.block_rq_issue =
        bpf_program__attach(obj->progs.block_rq_issue);
err = libbpf_get_error(obj->links.block_rq_issue);
[...]

You can see the complete code in biolatency.c.

Custom load and attach

Skeleton is suitable for almost all scenarios, but there is a special case: perf events. In this case, instead of using links from struct <name>__bpf, you need to define an array: struct bpf_link *links[]. The reason is that perf_event needs to be opened separately on each CPU.

After this, open and attach perf_event by yourself:

static int open_and_attach_perf_event(int freq, struct bpf_program *prog,
                                struct bpf_link *links[])
{
        struct perf_event_attr attr = {
                .type = PERF_TYPE_SOFTWARE,
                .freq = 1,
                .sample_period = freq,
                .config = PERF_COUNT_SW_CPU_CLOCK,
        };
        int i, fd; 

        for (i = 0; i < nr_cpus; i++) {
                fd = syscall(__NR_perf_event_open, &attr, -1, i, -1, 0);
                if (fd < 0) {
                        fprintf(stderr, "failed to init perf sampling: %s\n",
                                strerror(errno));
                        return -1;
                    }
                links[i] = bpf_program__attach_perf_event(prog, fd);
                if (libbpf_get_error(links[i])) {
                        fprintf(stderr, "failed to attach perf event on cpu: "
                                "%d\n", i);
                        links[i] = NULL;
                        close(fd);
                        return -1;
                }
        } 

        return 0;
}

Finally, during the tear down phase, remember to destroy each link in the links and then destroy links.

You can see the complete code in runqlen.c.

Multiple handlers for the same event

Starting in v0.2, libbpf supports multiple entry-point BPF programs within the same executable and linkable format (ELF) section. Therefore, you can attach multiple BPF programs to the same event (such as tracepoints or kprobes) without worrying about ELF section name clashes. For details, see Add libbpf full support for BPF-to-BPF calls. Now, you can naturally define multiple handlers for an event like this:

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry1, int irq, struct irqaction *action)
{
            [...]
}

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry2)
{
            [...]
}

You can see the complete code in hardirqs.bpf.c (built with libbpf-bootstrap).

If your libbpf version is earlier than v2.0, to define multiple handlers for an event, you have to use multiple program types, for example:

SEC("tracepoint/irq/irq_handler_entry")
int handle__irq_handler(struct trace_event_raw_irq_handler_entry *ctx)
{
        [...]
}

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry)
{
        [...]
}

You can see the complete code in hardirqs.bpf.c.

Maps

Reduce pre-allocation overhead

Beginning in Linux 4.6, BPF hash maps perform memory pre-allocation by default and introduce the BPF_F_NO_PREALLOC flag. The motivation for doing so is to avoid kprobe + bpf deadlocks. The community had tried other solutions, but in the end, pre-allocating all the map elements was the simplest solution and didn't affect the user space visible behavior.

When full map pre-allocation is too memory expensive, define the map with the BPF_F_NO_PREALLOC flag to keep old behavior. For details, see bpf: map pre-alloc. When the map size is not large (such as MAX_ENTRIES = 256), this flag is not necessary. BPF_F_NO_PREALLOC is slower.

Here is an example:

struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, MAX_ENTRIES);
        __type(key, u32);
        __type(value, u64);
        __uint(map_flags, BPF_F_NO_PREALLOC);
} start SEC(".maps");

You can see many cases in libbpf-tools.

Determine the map size at runtime

One advantage of libbpf-tools is that it is portable, so the maximum space required for the map may be different for different machines. In this case, you can define the map without specifying the size and resize it before load. For example:

In <name>.bpf.c, define the map as:

struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __type(key, u32);
        __type(value, u64);
} start SEC(".maps");

After the open phase, call bpf_map__resize(). For example:

struct cpudist_bpf *obj;

[...]
obj = cpudist_bpf__open();
bpf_map__resize(obj->maps.start, pid_max);

You can see the complete code in cpudist.c.

Per-CPU

When you select the map type, if multiple events are associated and occur on the same CPU, using a per-CPU array to track the timestamp is much simpler and more efficient than using a hashmap. However, you must be sure that the kernel doesn't migrate the process from one CPU to another between two BPF program invocations. So you can't always use this trick. The following example analyzes soft interrupts, and it meets both these conditions:

struct {
        __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
        __uint(max_entries, 1);
        __type(key, u32);
        __type(value, u64);
} start SEC(".maps");

SEC("tp_btf/softirq_entry")
int BPF_PROG(softirq_entry, unsigned int vec_nr)
{
        u64 ts = bpf_ktime_get_ns();
        u32 key = 0; 

        bpf_map_update_elem(&start, &key, &ts, 0);
        return 0;
}

SEC("tp_btf/softirq_exit")
int BPF_PROG(softirq_exit, unsigned int vec_nr)
{
        u32 key = 0;
        u64 *tsp;

        [...]
        tsp = bpf_map_lookup_elem(&start, &key);
        [...]
}

You can see the complete code in softirqs.bpf.c.

Global variables

Not only can you use global variables to customize BPF program logic, you can use them instead of maps to make your program simpler and more efficient. Global variables can be any size. You just need to set global variables to be a fixed size (or at least with a bounded maximum size if you don't mind wasting some memory).

For example:

Because the number of SOFTIRQ types is fixed, you can define global arrays to save counts and histograms in softirq.bpf.c:

__u64 counts[NR_SOFTIRQS] = {};
struct hist hists[NR_SOFTIRQS] = {};

Then, you can traverse the array directly in user space:

static int print_count(struct softirqs_bpf__bss *bss)
{
        const char *units = env.nanoseconds ? "nsecs" : "usecs";
        __u64 count;
        __u32 vec;

        printf("%-16s %6s%5s\n", "SOFTIRQ", "TOTAL_", units);

        for (vec = 0; vec < NR_SOFTIRQS; vec++) {
                count = __atomic_exchange_n(&bss->counts[vec], 0,
                                        __ATOMIC_RELAXED);
                if (count > 0)
                        printf("%-16s %11llu\n", vec_names[vec], count);
        }

        return 0;
}

You can see the complete code in softirqs.c

Watch out for directly accessing fields through pointers

As you know from the BPF Portability and CO-RE blog post, the libbpf + BPF_PROG_TYPE_TRACING approach gives you a smartness of BPF verifier. It understands and tracks BTF natively and allows you to follow pointers and read kernel memory directly (and safely). For example:

u64 inode = task->mm->exe_file->f_inode->i_ino;

This is very cool. However, when you use such expressions in conditional statements, there is a bug that this branch is optimized away in some kernel versions. In this case, until bpf: fix an incorrect branch elimination by verifier is widely backported, please use BPF_CORE_READ for kernel compatibility. You can find an example in biolatency.bpf.c:

SEC("tp_btf/block_rq_issue")
int BPF_PROG(block_rq_issue, struct request_queue *q, struct request *rq)
{
        if (targ_queued && BPF_CORE_READ(q, elevator))
                return 0;
        return trace_rq_start(rq);
}

You can see even though it's a tp_btf program and q->elevator will be faster, I have to use BPF_CORE_READ(q, elevator) instead.

Conclusion

This article introduced some tips for writing BPF programs with libbpf. You can find many practical examples from libbpf-tools and bpf. If you have any questions, you can join the TiDB community on Slack and send us your feedback.

Originally pulished at PingCAP.

Why We Disable Linux's THP Feature for Databases

Wenbo Zhang — Thu, 07 Jan 2021 14:51:43 +0000

Linux's memory management system is transparent to the user. However, if you're not familiar with its working principles, you might meet unexpected performance issues. That's especially true for sophisticated software like databases. When databases are running in Linux, even small system variations might impact performance.

After an in-depth investigation, we found that Transparent Huge Page (THP), a Linux memory management feature, often slows down database performance. In this post, I'll describe how THP causes performance to fluctuate, the typical symptoms, and our recommended solutions.

What is THP

THP is an important feature of the Linux kernel. It maps page table entries to larger page sizes to reduce page faults. This improves the translation lookaside buffer (TLB) hit ratio. TLB is a memory cache used by the memory management unit to improve the translation speed from virtual memory addresses to physical memory addresses.

When the application data being accessed is contiguous, THP often boosts performance. In contrast, if the memory access patterns are not contiguous, THP can't fulfill its duty, and it may even cause system instability.

Unfortunately, database workloads are known to have sparse rather than contiguous memory access. Therefore, you should disable THP for your database.

How Linux manages its memory

To understand the harm THP can cause, let's consider how Linux manages its physical memory.

For different architectures, the Linux kernel employs different memory mapping approaches. Among them, the user space maps the memory via multi-level paging to save space, while the kernel space uses linear mapping to achieve simplicity and high efficiency.

When the kernel starts, it adds physical pages to the buddy system. Every time the user applies for memory, the buddy system allocates the desired pages. When the user releases memory, the buddy system deallocates the pages.

To accommodate low-speed devices and various workloads, Linux divides the memory pages into anonymous pages and file-based pages. Linux uses page cache to cache files for low-speed devices. When memory is insufficient, users can employ swap cache and swappiness to specify a proportion of the two types of pages to be released.

To respond to the user's memory application as soon as possible and guarantee that the system runs normally when the memory resources are insufficient, Linux defines three watermarks: high, low, and min.

If the unused physical memory is less than low and more than min, when the user applies for memory, the page replacement daemon kswapd asynchronously frees memory until the available physical memory is higher than high.
If the asynchronous memory reclaim can't keep up with the memory application, Linux triggers the synchronous direct reclaim. In such cases, all memory-related threads synchronously take part in freeing memory. When enough memory becomes available, the threads start to get the memory space they apply for.

During the direct reclaim, if the pages are clean, the blockage caused by synchronous reclaim is short; otherwise, it might result in tens of milliseconds of latency, and, depending on the back-end devices, sometimes even seconds.

Apart from the watermarks, another mechanism may also cause direct memory reclaim. Sometimes, a thread applies for a large section of continuous memory pages. If there is enough physical memory, but it's fragmented, the kernel performs memory compaction. This might also trigger a direct memory reclaim.

To sum up, when threads apply for memory, the major causes of latency are direct memory reclaim and memory compaction. For workloads whose memory access is not very contiguous, such as databases, THP may trigger the two tasks and thus cause fluctuating performance.

When THP causes performance fluctuation

If your system performance fluctuates, how can you be sure THP is the cause? I'd like to share three symptoms that we've found are related to THP.

The most typical symptom: `sys cpu` rises

Based on our customer support experience, the most typical symptom of THP-caused performance fluctuation is sharply rising system CPU utilization.

In such cases, if you create an on-cpu flame graph using perf, you'll see that all the service threads that are in the runnable state are performing memory compaction. In addition, the page fault exception handler is do_huge_pmd_anonymous_page. This means that the current system doesn't have 2 MB of contiguous physical memory and that triggers the direct memory compaction. The direct memory compaction is time-consuming, so it leads to high system CPU utilization.

The indirect symptom: `sys load` rises

Many memory issues are not as obvious as those described above. When the system allocates or other high-level memory, it doesn't perform memory compaction directly and leave you an obvious trace. Instead, it often mixes the compaction with other tasks, such as direct memory reclaim.

Involving direct reclaim in the process makes our troubleshooting more perplexing. For example, when the unused physical memory in the normal zone is higher than the high watermark, the system still continuously reclaims memory. To get to the bottom of this, we need to dive deeper into the processing logic of slow memory allocation.

The slow memory allocation breaks down into four major steps:

Asynchronous memory compaction
Direct memory reclaim
Direct memory compaction
Out of memory (OOM) collection

After each step, the system tries to allocate memory. If the allocation succeeds, the system returns the allocated page and skips the remaining steps. For each allocation, the kernel provides a fragmentation index for each order in the buddy system, which indicates whether the allocation failure is caused by insufficient memory or by fragmented memory.

The fragmentation index is associated with the /proc/sys/vm/extfrag_threshold parameter. The closer the number is to 1,000, the more the allocation failure is related to memory fragmentation, and the kernel is more likely to perform memory compaction. The closer the number is to 0, the more the allocation failure is related to insufficient memory, and the kernel is more inclined to perform memory reclaim.

Therefore, even when the unused memory is higher than the high watermark, the system may also frequently reclaim memory. Because THP consumes high-level memory, it compounds the performance fluctuation caused by memory fragmentation.

To verify whether the performance fluctuation is related to memory fragmentation:

View the direct memory reclaim operations taken per second. Execute sar -b to observe pgscand/s. If this number is greater than 0 for a consecutive period of time, take the following steps to troubleshoot the problem.
Observe the memory fragmentation index. Execute cat /sys/kernel/debug/extfrag/extfrag_index to get the index*.* Focus on the fragmentation index of the block whose order is >= 3. If the number is close to 1,000, the fragmentation is severe; if it's close to 0, the memory is insufficient.
View the memory fragmentation status. Execute cat /proc/buddyinfo and cat /proc/pagetypeinfo to show the status. (Refer to the Linux manual page for details.) Focus on the number of pages whose order is >= 3.

Compared to buddyinfo, pagetypeinfo displays more detailed information grouped by migration types. The buddy system implements anti-fragmentation through migration types. Note that if all the Unmovable pages are grouped in order < 3, the kernel slab objects have severe fragmentation. In such cases, you need to troubleshoot the specific cause of the problem using other tools.
For kernels that support the Berkeley Packet Filter (BPF), such as CentOS 7.6, you may also perform quantitative analysis on the latency using drsnoop or compactsnoop developed by PingCAP.
(Optional) Trace the mm_page_alloc_extfrag event with ftrace. Due to memory fragmentation, the migration type steals physical pages from the backup migration type.

The atypical symptom: abnormal RES usage

Sometimes, when the service starts on an AARCH64 server, dozens of gigabytes of physical memory are occupied. By viewing the /proc/pid/smaps file, you may see that most memory is used for THP. Because AARCH64's CentOS 7 kernel sets its page size as 64 KB, its resident memory usage is many times larger than that of the x86_64 platform.

How to deal with THP

For applications that are not optimized to store their data contiguously, or applications that have sparse workloads, enabling THP and THP defrag is detrimental to the long-running services.

Before Linux v4.6, the kernel doesn't provide defer or defer + madvise for THP defrag. Therefore, for CentOS 7, which uses the v3.10 kernel, it is recommended to disable THP. If your applications do need THP, however, we suggest that you set THP as madvise, which allocates THP via the madvise system call. Otherwise, setting THP as never is the best choice for your application.

To disable THP:

View the current THP configuration:

cat /sys/kernel/mm/transparent_hugepage/enabled

If the value is always, execute the following commands:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Note that if you restart the server, THP might be turned on again. You can write the two commands in the .service file and let systemd manage it for you.

Join our community

If you have any other questions about database performance tuning, or would like to share your expertise, feel free to join the TiDB Community Slack workspace.

Originally pulished at pingcap.com.

Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis

Wenbo Zhang — Mon, 14 Dec 2020 08:00:10 +0000

Distributed clusters might encounter performance problems or unpredictable failures, especially when they are running in the cloud. Of all the kinds of failures, kernel failures may be the most difficult to analyze and simulate.

A practical solution is Berkeley Packet Filter (BPF), a highly flexible, efficient virtual machine that runs in the Linux kernel. It allows bytecode to be safely executed in various hooks, which exist in a variety of Linux kernel subsystems. BPF is mainly used for networking, tracing, and security.

Based on BPF, there are two development modes:

The BPF Compiler Collection (BCC) toolkit offers many useful resources and examples to construct effective kernel tracing and manipulation programs. However, it has disadvantages.
libbpf + BPF CO-RE (Compile Once – Run Everywhere) is a different development and deployment mode than the BCC framework. It greatly reduces storage space and runtime overhead, which enables BPF to support more hardware environments, and it optimizes programmers' development experience.

In this post, I'll describe why libbpf-tools, a collection of applications based on the libbpf + BPF CO-RE mode, is a better solution than bcc-tools and how we're using libbpf-tools at PingCAP.

Why libbpf + BPF CO-RE is better than BCC

BCC vs. libbpf + BPF CO-RE

BCC embeds LLVM or Clang to rewrite, compile, and load BPF programs. Although it does its best to simplify BPF developers' work, it has these drawbacks:

It uses the Clang front-end to modify user-written BPF programs. When a problem occurs, it's difficult to find the problem and figure out a solution.
You must remember naming conventions and automatically generated tracepoint structs.
Because the libbcc library contains a huge LLVM or Clang library, when you use it, you might encounter some issues:
- When a tool starts, it takes many CPU and memory resources to compile the BPF program. If it runs on a server that lacks system resources, it might trigger a problem.
- BCC depends on kernel header packages, which you must install on each target host. If you need unexported content in the kernel, you must manually copy and paste the type definition into the BPF code.
- Because BPF programs are compiled during runtime, many simple compilation errors can only be detected at runtime. This affects your development experience.

By contrast, BPF CO-RE has these advantages:

When you implement BPF CO-RE, you can directly use the libbpf library provided by kernel developers to develop BPF programs. The development method is the same as writing ordinary C user-mode programs: one compilation generates a small binary file.
Libbpf acts like a BPF program loader and relocates, loads, and checks BPF programs. BPF developers only need to focus on the BPF programs' correctness and performance.
This approach minimizes overhead and removes huge dependencies, which makes the overall development process smoother.

For more details, see Why libbpf and BPF CO-RE?.

Performance comparison

Performance optimization master Brendan Gregg used libbpf + BPF CO-RE to convert a BCC tool and compared their performance data. He said: "As my colleague Jason pointed out, the memory footprint of opensnoop as CO-RE is much lower than opensnoop.py. 9 Mbytes for CO-RE vs 80 Mbytes for Python."

According to his research, compared with BCC at runtime, libbpf + BPF CO-RE reduced memory overhead by nearly nine times, which greatly benefits servers with scarce physical memory.

How we're using libbpf-tools at PingCAP

At PingCAP, we've been following BPF and its community development for a long time. In the past, every time we added a new machine, we had to install a set of BCC dependencies on it, which was troublesome. After Andrii Nakryiko (the libbpf + BPF CO-RE project's leader) added the first libbpf-tools to the BCC project, we did our research and switched from bcc-tools to libbpf-tools. Fortunately, during the switch, we got guidance from him, Brendan, and Yonghong Song (the BTF project's leader). We've converted 18 BCC or bpftrace tools to libbpf + BPF CO-RE, and we're using them in our company.

For example, when we analyzed the I/O performance of a specific workload, we used multiple performance analysis tools at the block layer:

Task	Performance analysis tool
Check I/O requests' latency distribution	./biolatency -d nvme0n1
Analyze I/O mode	./biopattern -T 1 -d 259:0
Check the request size distribution diagram when the task sent physical I/O requests	./bitesize -c fio -T
Analyze each physical I/O	./biosnoop -d nvme0n1

The analysis results helped us optimize I/O performance. We're also exploring whether the scheduler-related libbpf-tools are helpful for tuning the TiDB database.

These tools are universal: feel free to give them a try. In the future, we'll implement more tools based on libbpf-tools. If you'd like to learn more about our experience with these tools, you can join the TiDB community on Slack.

This article was originally publish at pingcap.com on Dec 3, 2020

DEV Community: Wenbo Zhang

Linux Kernel vs. Memory Fragmentation (Part II)

Memory compaction

Algorithm introduction

When memory compaction is performed

How to analyze memory compaction

Quantify the performance latency

View the fragmentation index

Pros and cons

How to mitigate memory fragmentation

Conclusion

Linux Kernel vs. Memory Fragmentation (Part I)

A brief history of defragmentation

Linux buddy memory allocator

Group by migration types

Analyze external memory fragmentation events

How to Trace Linux System Calls in Production with Minimal Impact on Performance

perf, a performance profiler for Linux

Traceloop, a performance profiler for cgroup v2 and K8s

Benchmark with system calls traced

Summary of Linux profilers

Tips and Tricks for Writing Linux BPF Applications with libbpf

Program skeleton

Combining the open and load phases

Selective attach

Custom load and attach

Multiple handlers for the same event

Maps

Reduce pre-allocation overhead

Determine the map size at runtime

Per-CPU

Global variables

Watch out for directly accessing fields through pointers

Conclusion

Why We Disable Linux's THP Feature for Databases

What is THP

How Linux manages its memory

When THP causes performance fluctuation

The most typical symptom: sys cpu rises

The indirect symptom: sys load rises

The atypical symptom: abnormal RES usage

How to deal with THP

Join our community

Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis

Why libbpf + BPF CO-RE is better than BCC

BCC vs. libbpf + BPF CO-RE

Performance comparison

How we're using libbpf-tools at PingCAP

The most typical symptom: `sys cpu` rises

The indirect symptom: `sys load` rises