<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wenbo Zhang</title>
    <description>The latest articles on DEV Community by Wenbo Zhang (@ethercflow).</description>
    <link>https://dev.to/ethercflow</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F540018%2F1b579418-90aa-4f67-a5ba-3a864a3ec784.png</url>
      <title>DEV Community: Wenbo Zhang</title>
      <link>https://dev.to/ethercflow</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ethercflow"/>
    <language>en</language>
    <item>
      <title>Linux Kernel vs. Memory Fragmentation (Part II)</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Wed, 26 May 2021 05:52:28 +0000</pubDate>
      <link>https://dev.to/ethercflow/linux-kernel-vs-memory-fragmentation-part-ii-6mg</link>
      <guid>https://dev.to/ethercflow/linux-kernel-vs-memory-fragmentation-part-ii-6mg</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EQeNHjgZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/linux-memory-fragmentation-and-defragmentation.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EQeNHjgZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/linux-memory-fragmentation-and-defragmentation.png" alt="Linux kernel memory fragmentation and defragmentation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1"&gt;Linux Kernel vs. Memory Fragmentation (Part I)&lt;/a&gt;, I concluded that grouping by migration types only delays memory fragmentation, but does not fundamentally solve it. As the memory fragmentation increases and it does not have enough contiguous physical memory, performance degrades.&lt;/p&gt;

&lt;p&gt;Therefore, to mitigate the performance degradation, the Linux kernel community introduced &lt;strong&gt;memory compaction&lt;/strong&gt; to the kernel.&lt;/p&gt;

&lt;p&gt;In this post, I'll explain the principle of memory compaction, how to view the fragmentation index, and how to quantify the latency overheads caused by memory compaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory compaction
&lt;/h2&gt;

&lt;p&gt;Before memory compaction, the kernel used lumpy reclaim for defragmentation. However, this feature was removed from v3.10 (currently the most widely used kernel version). If you'd like to learn more, you can read about lumpy reclaim in the articles I listed in &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1#a-brief-history-of-defragmentation"&gt;A brief history of defragmentation&lt;/a&gt;. For now, let me bring your mind to memory compaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm introduction
&lt;/h3&gt;

&lt;p&gt;The article &lt;a href="https://lwn.net/Articles/368869/"&gt;Memory compaction&lt;/a&gt; on LWN.net explains the algorithmic idea of memory compaction in detail. You can take the following fragmented zone as a simple example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--G_fB8HNC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/a-small-fragmented-memory-zone.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--G_fB8HNC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/a-small-fragmented-memory-zone.png" alt="A small fragmented memory zone"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A small fragmented memory zone - LWN.net &lt;/p&gt;

&lt;p&gt;The white boxes are free pages, while those in red are allocated pages.&lt;/p&gt;

&lt;p&gt;Memory compaction for this zone breaks down into three major steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Scan this zone from left to right for red pages of the MIGRATE_MOVABLE migration type.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rjxHISN_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/linux-kernel-movable-pages.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rjxHISN_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/linux-kernel-movable-pages.png" alt="Search for movable pages"&gt;&lt;/a&gt;&lt;/p&gt;

 Search for movable pages &lt;/li&gt;
&lt;li&gt;
&lt;p&gt;At the same time, scan this zone from right to left for free pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dPtWrv1H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/linux-kernel-free-pages.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dPtWrv1H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/linux-kernel-free-pages.png" alt="Search for free pages"&gt;&lt;/a&gt;&lt;/p&gt;

 Search for free pages &lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Shift movable pages at the bottom to free pages at the top, thus creating a contiguous chunk of free space.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--g3Vv8O9o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/a-small-memory-zone-after-memory-compaction.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--g3Vv8O9o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/a-small-memory-zone-after-memory-compaction.png" alt="The small memory zone after memory compaction"&gt;&lt;/a&gt;&lt;/p&gt;

 The memory zone after memory compaction &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This principle seems relatively simple, and the kernel also provides &lt;code&gt;/proc/sys/vm/compact_memory&lt;/code&gt; as the interface for manually triggering memory compaction.&lt;/p&gt;

&lt;p&gt;However, as mentioned in &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1"&gt;Part I&lt;/a&gt; and &lt;a href="https://lwn.net/Articles/591998/"&gt;Memory compaction issues&lt;/a&gt;, memory compaction is not very efficient in practice—at least for the most commonly-used kernel, v3.10—no matter whether it is triggered automatically or manually. Due to the high overhead it causes, it becomes a performance bottleneck instead.&lt;/p&gt;

&lt;p&gt;The open source community did not abandon this feature but continued to optimize it in subsequent versions. For example, the community &lt;a href="https://github.com/torvalds/linux/commit/698b1b3064"&gt;introduced kcompactd&lt;/a&gt; to the kernel in v4.6 and &lt;a href="https://lwn.net/Articles/686801/"&gt;made direct compaction more deterministic&lt;/a&gt; in v4.8.&lt;/p&gt;

&lt;h3&gt;
  
  
  When memory compaction is performed
&lt;/h3&gt;

&lt;p&gt;In kernel v3.10, memory compaction is performed under any of the following situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;kswapd&lt;/code&gt; kernel thread is called to balance zones after a failed high-order allocation.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;khugepaged&lt;/code&gt; kernel thread is called to collapse a huge page.&lt;/li&gt;
&lt;li&gt;Memory compaction is manually triggered via the &lt;code&gt;/proc&lt;/code&gt; interface.&lt;/li&gt;
&lt;li&gt;The system performs direct reclaim to meet higher-order memory requirements, including handling Transparent Huge Page (THP) page fault exceptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;a href="https://pingcap.com/blog/why-we-disable-linux-thp-feature-for-databases"&gt;Why We Disable Linux's THP Feature for Databases&lt;/a&gt;, I described how THP slows down performance and recommended disabling this feature. I will put it aside in this article and mainly focus on the memory allocation path.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QeLICvVl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/memory-allocation-in-the-slow-path.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QeLICvVl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/memory-allocation-in-the-slow-path.png" alt="Memory allocation in the slow path"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Memory allocation in the slow path &lt;/p&gt;

&lt;p&gt;When the kernel allocates pages, if there are no available pages in the free lists of the buddy system, the following occurs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The kernel processes this request in the slow path and tries to allocate pages using the low watermark as the threshold.&lt;/li&gt;
&lt;li&gt;If the memory allocation fails, which indicates that the memory may be slightly insufficient, the page allocator wakes up the &lt;code&gt;kswapd&lt;/code&gt; thread to asynchronously reclaim pages and attempts to allocate pages again, also using the low watermark as the threshold.&lt;/li&gt;
&lt;li&gt;If the allocation fails again, it means that the memory shortage is severe. In this case, the kernel runs asynchronous memory compaction first.&lt;/li&gt;
&lt;li&gt;If the allocation still does not succeed after the async memory compaction, the kernel directly reclaims memory.&lt;/li&gt;
&lt;li&gt;After the direct memory reclaim, if the kernel doesn't reclaim enough pages to meet the demand, it performs direct memory compaction. If it doesn't reclaim a single page, the OOM Killer is called to deallocate memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The above steps are only a simplified description of the actual workflow. In real practice, it is more complicated and will be different depending on the requested memory order and allocation flags.&lt;/p&gt;

&lt;p&gt;As for direct memory reclaim, it is not only performed by the kernel when the memory is severely insufficient, but also triggered due to memory fragmentation in practical scenarios. In a certain period, these two situations may occur simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to analyze memory compaction
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Quantify the performance latency
&lt;/h4&gt;

&lt;p&gt;As mentioned in the previous section, the kernel may perform memory reclaim or memory compaction when allocating memory. To make it easier to quantify the latency caused by direct memory reclaim and memory compaction for each participating thread, I committed two tools, &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/drsnoop_example.txt"&gt;drsnoop&lt;/a&gt; and &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/compactsnoop_example.txt"&gt;compactsnoop&lt;/a&gt;, to the &lt;a href="https://github.com/iovisor/bcc"&gt;BCC&lt;/a&gt; project.&lt;/p&gt;

&lt;p&gt;Both tools are based on kernel events and come with detailed documentation, but there is one thing I want to note: to reduce the cost of introducing Berkeley Packet Filters (BPF), these two tools capture the latency of each corresponding event. Therefore, you may see from the output that each memory request corresponds to multiple latency results.&lt;/p&gt;

&lt;p&gt;The reason for the many-to-one relationship is that, for older kernels like v3.10, it is uncertain how many times the kernel will try to allocate during a memory allocation process in the slow path. The uncertainty also makes OOM Killer start to work either too early or too late, resulting in most tasks on the server being hung up for a long time.&lt;/p&gt;

&lt;p&gt;After the kernel merged the patch &lt;a href="https://github.com/torvalds/linux/commit/c73322d0"&gt;mm: fixed 100% CPU kswapd busyloop on unreclaimable nodes&lt;/a&gt; in v4.12, the maximum number of direct memory reclaims is limited to 16. Let's assume that the average latency of a direct memory reclaim is 10 ms. (Shrinking active or inactive LRU chain tables is time consuming for today's servers with several hundred gigabytes of RAM. There is also an additional delay if the server needs to wait for a dirty page to be written back.)&lt;/p&gt;

&lt;p&gt;If a thread asks the page allocator for pages and gets enough memory after only one direct memory reclaim, the latency of this allocation increases by 10 ms. If the kernel tries 16 times before reclaiming enough memory spaces, then the increased latency of this allocation is 160 ms instead of 10 ms, which may severely degrade performance.&lt;/p&gt;

&lt;h4&gt;
  
  
  View the fragmentation index
&lt;/h4&gt;

&lt;p&gt;Let's come back to memory compaction. There are four main steps for the core logic of memory compaction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Determine whether a memory zone is suitable for memory compaction.&lt;/li&gt;
&lt;li&gt;Set the starting page frame number for scanning.&lt;/li&gt;
&lt;li&gt;Isolate pages of the MIGRATE_MOVABLE type.&lt;/li&gt;
&lt;li&gt;Migrate pages of the MIGRATE_MOVABLE type to the top of the zone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the zone still needs compaction after one migration, the kernel loops the above process for three to four times until the compaction is finished. This operation consumes a lot of CPU resources; therefore, you can often see from the monitoring that the system CPU usage is full.&lt;/p&gt;

&lt;p&gt;Well then, how does the kernel determine whether a zone is suitable for memory compaction?&lt;/p&gt;

&lt;p&gt;If you use the &lt;code&gt;/proc/sys/vm/compact_memory&lt;/code&gt; interface to forcibly require memory compaction for a zone, there is no need for the kernel to determine it.&lt;/p&gt;

&lt;p&gt;If memory compaction is automatically triggered, the kernel calculates the fragmentation index of the requested order to determine whether the zone has enough memory left for compaction. The closer the index is to 0, the more the memory allocation is likely to fail due to insufficient memory. This means memory reclaim is more suitable than memory compaction at this time. The closer the index is to 1,000, the more the allocation is likely to fail due to excessive external fragmentation. Therefore, in this situation, it is appropriate to do memory reclaim, not memory compaction.&lt;/p&gt;

&lt;p&gt;Whether the kernel chooses to perform memory compaction or memory reclaim is determined by the external fragmentation threshold. You can view this threshold through the &lt;code&gt;/proc/sys/vm/extfrag_threshold&lt;/code&gt; interface.&lt;/p&gt;

&lt;p&gt;You can execute &lt;code&gt;cat /sys/kernel/debug/extfrag/extfrag_index&lt;/code&gt; to directly view the fragmentation index through &lt;code&gt;/sys/kernel/debug/extfrag/extfrag_index&lt;/code&gt;. Note that the following screen shot results are divided by 1,000:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--R_Hfd7eY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/fragment-index-command.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--R_Hfd7eY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/v1/media/fragment-index-command.png" alt="Linux  raw `/sys/kernel/debug/extfrag/extfrag_index` endraw  command"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros and cons
&lt;/h4&gt;

&lt;p&gt;Both the monitoring interfaces based on the &lt;code&gt;/proc&lt;/code&gt; file system and the tools based on kernel events (&lt;a href="https://github.com/iovisor/bcc/blob/master/tools/drsnoop_example.txt"&gt;drsnoop&lt;/a&gt; and &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/compactsnoop_example.txt"&gt;compactsnoop&lt;/a&gt;) can be used to analyze memory compaction, but with different pros and cons.&lt;/p&gt;

&lt;p&gt;The monitoring interfaces are simple to use, but they cannot perform quantitative analysis on the latency results, and the sampling period is too long. The tools based on kernel events can solve these problems, but you need a certain understanding of the working principle of the kernel-related subsystems to use these tools, and there are certain requirements for the client's kernel version.&lt;/p&gt;

&lt;p&gt;Therefore, the monitoring interfaces and the kernel-events-based tools actually complement each other. Using them together can help you to analyze memory compaction thoroughly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to mitigate memory fragmentation
&lt;/h3&gt;

&lt;p&gt;The kernel is designed to take care of slow backend devices. For example, it implements the second chance method and the refault distance based on the LRU algorithm and does not support limiting the percentage of &lt;code&gt;page cache&lt;/code&gt;. Some companies used to customize their own kernel to limit the &lt;code&gt;page cache&lt;/code&gt; and tried to submit it to the upstream kernel community, but the community did not accept it. I think it may be because this feature causes problems such as working set refaults.&lt;/p&gt;

&lt;p&gt;Therefore, to reduce the frequency of direct memory reclaim and mitigate fragmentation issues, it is a good choice to increase &lt;code&gt;vm.min_free_kbytes&lt;/code&gt; (up to 5% of the total memory). This indirectly limits the percentage of &lt;code&gt;page cache&lt;/code&gt; for scenarios with a lot of I/O operations, and the machine has more than 100 GB of memory.&lt;/p&gt;

&lt;p&gt;Although setting &lt;code&gt;vm.min_free_kbytes&lt;/code&gt; to a bigger value wastes some memory, it is negligible. For example, if a server has 256 GB memory and you set &lt;code&gt;vm.min_free_kbytes&lt;/code&gt; to &lt;code&gt;"4G"&lt;/code&gt;, it only takes 1.5% of the total memory space.&lt;/p&gt;

&lt;p&gt;The community apparently noticed the waste of memory as well, so the kernel merged the patch &lt;a href="http://lkml.iu.edu/hypermail/linux/kernel/1602.3/02009.html"&gt;mm: scale kswapd watermarks in proportion to memory&lt;/a&gt; in v4.6 to optimize it.&lt;/p&gt;

&lt;p&gt;Another solution is to execute &lt;code&gt;drop cache&lt;/code&gt; at the right time, but it may cause more jitter to the application performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1"&gt;Part I&lt;/a&gt; of this post series, I briefly explained why the external fragmentation affects performance and introduced the efforts the community has made over the years in defragmentation. Here in Part II, I've focused on the defragmentation principles in the kernel v3.10 and how to observe memory fragmentation quantitatively and qualitatively.&lt;/p&gt;

&lt;p&gt;I hope this post series will be helpful for you! If you have any other thoughts about Linux memory management, welcome to join the &lt;a href="https://slack.tidb.io/invite?team=tidb-community&amp;amp;channel=everyone&amp;amp;ref=pingcap-blog"&gt;TiDB Community Slack&lt;/a&gt; workspace to share and discuss with us.&lt;/p&gt;

&lt;p&gt;Originally pulished at &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-2"&gt;Linux Kernel vs. Memory Fragmentation (Part II)&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Linux Kernel vs. Memory Fragmentation (Part I)</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Thu, 01 Apr 2021 06:31:13 +0000</pubDate>
      <link>https://dev.to/ethercflow/linux-kernel-vs-memory-fragmentation-part-i-4122</link>
      <guid>https://dev.to/ethercflow/linux-kernel-vs-memory-fragmentation-part-i-4122</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-memory-fragmentation-and-defragmentation.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-memory-fragmentation-and-defragmentation.png" alt="Linux kernel memory fragmentation and defragmentation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(External) memory fragmentation is a long-standing Linux kernel programming issue. As the system runs, it assigns various tasks to memory pages. Over time, memory gets fragmented, and eventually, a busy system that is up for a long time may have only a few contiguous physical pages.&lt;/p&gt;

&lt;p&gt;Because the Linux kernel supports virtual memory management, physical memory fragmentation is often not an issue. With page tables, unless large pages are used, physically scattered memory is still contiguous in the virtual address space.&lt;/p&gt;

&lt;p&gt;However, it becomes very difficult to allocate contiguous physical memory from the kernel linear mapping area. For example, it is challenging to allocate structure objects through the block allocator—a common and frequent operation in the kernel mode—or operate on a Direct Memory Access (DMA) buffer that does not support the scatter and gather modes. Such operations might cause frequent direct memory reclamation or compaction, resulting in large fluctuations in system performance, or allocation failure. In slow memory allocation paths, different operations are performed according to the page allocation flag.&lt;/p&gt;

&lt;p&gt;If the kernel programming no longer relies on the high-order physical memory allocation in the linear address space, the memory fragmentation issue will be solved. However, for a huge project like the Linux kernel, it isn't practical to make such changes.&lt;/p&gt;

&lt;p&gt;Since Linux 2.x, the open source community has tried several methods to alleviate the memory fragmentation issue, including many effective, but unusual patches. Some merged patches have been controversial, such as the memory compaction mechanism. At the &lt;a href="https://lwn.net/Articles/591998/" rel="noopener noreferrer"&gt;LSFMM 2014&lt;/a&gt; conference, many developers complained that memory compaction was not very efficient and that bugs were not easy to reproduce. But the community did not abandon the feature and continued to optimize it in subsequent versions.&lt;/p&gt;

&lt;p&gt;Mel Gorman is the most persistent contributor in this field. He has submitted two sets of important patches. The first set was merged in Linux 2.6.24 and iterated over 28 versions before the community accepted it. The second set was merged in Linux 5.0 and successfully reduced memory fragmentation events by 94% on one- or two-socket machines.&lt;/p&gt;

&lt;p&gt;In this post, I'll introduce some common extensions to the &lt;a href="https://en.wikipedia.org/wiki/Buddy_memory_allocation" rel="noopener noreferrer"&gt;buddy allocator&lt;/a&gt; that helps prevent memory fragmentation in the Linux 3.10 kernel, the principle of memory compaction, how to view the fragmentation index, and how to quantify the latency overheads caused by memory compaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  A brief history of defragmentation
&lt;/h2&gt;

&lt;p&gt;Before I start, I want to recommend some good reads. The following articles show you all the efforts of improving high-level memory allocation during Linux kernel development.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;tr&gt;
   &lt;td&gt;Publish date
   &lt;/td&gt;
   &lt;td&gt;Articles on LWN.net
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2004-09-08
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/101230/" rel="noopener noreferrer"&gt;Kswapd and high-order allocations&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2004-05-10
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/105021/" rel="noopener noreferrer"&gt;Active memory defragmentation&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2005-02-01
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/121618/" rel="noopener noreferrer"&gt;Yet another approach to memory fragmentation&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2005-11-02
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/158211/" rel="noopener noreferrer"&gt;Fragmentation avoidance&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2005-11-08
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/159110/" rel="noopener noreferrer"&gt;More on fragmentation avoidance&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2006-11-28
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/211505/" rel="noopener noreferrer"&gt;Avoiding - and fixing - memory fragmentation&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2010-01-06
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/368869/" rel="noopener noreferrer"&gt;Memory compaction&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2014-03-26
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/591998/" rel="noopener noreferrer"&gt;Memory compaction issues&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2015-07-14
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/650917/" rel="noopener noreferrer"&gt;Making kernel pages movable&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2016-04-23
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/684611/" rel="noopener noreferrer"&gt;CMA and compaction&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2016-05-10
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/686801/" rel="noopener noreferrer"&gt;Make direct compaction more deterministic&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2017-03-21
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/717656/" rel="noopener noreferrer"&gt;Proactive compaction&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2018-10-31
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/770235/" rel="noopener noreferrer"&gt;Fragmentation avoidance improvements&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2020-04-21
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://lwn.net/Articles/817905/" rel="noopener noreferrer"&gt;Proactive compaction for the kernel&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, let‘s get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linux buddy memory allocator
&lt;/h2&gt;

&lt;p&gt;Linux uses the &lt;a href="https://en.wikipedia.org/wiki/Buddy_memory_allocation" rel="noopener noreferrer"&gt;buddy algorithm&lt;/a&gt; as a page allocator, which is simple and efficient. Linux has made some extensions to the classic algorithm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partitions' buddy allocator&lt;/li&gt;
&lt;li&gt;Per-CPU pageset&lt;/li&gt;
&lt;li&gt;Group by migration types&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Linux kernel uses node, zone, and page to describe physical memory. The partitions' buddy allocator focuses on a certain zone on a certain node.&lt;/p&gt;

&lt;p&gt;Before the 4.8 version, the Linux kernel implemented the page recycling strategy based on zone, because the early design was mainly for 32-bit processors, and there was a lot of high memory. However, the page aging speed of different zones on the same node was inconsistent, which caused many problems.&lt;/p&gt;

&lt;p&gt;Over a long period, the community has added a lot of tricky patches, but the problem has remained. With more 64-bit processors and large memory models being used in recent years, Mel Groman migrated the page recycling strategy from zone to node and solved this problem. If you use Berkeley Packet Filter (BPF) authoring tools to observe recycling operations, you need to know this.&lt;/p&gt;

&lt;p&gt;The per-CPU pageset optimizes single page allocation, which reduces lock contention between processors. It has nothing to do with defragmentation.&lt;/p&gt;

&lt;p&gt;Grouping by migration types is the defragmentation method I'll introduce in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Group by migration types
&lt;/h2&gt;

&lt;p&gt;First, you need to understand the memory address space layout. Each processor architecture has a definition. For example, the definition of x86_64 is in &lt;a href="https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt" rel="noopener noreferrer"&gt;mm.txt&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Because the virtual address and physical address are not linearly mapped, accessing the virtual address space through the page table (such as the heap memory requirement of the user space) does not require contiguous physical memory. Take the &lt;a href="https://en.wikipedia.org/wiki/Intel_5-level_paging" rel="noopener noreferrer"&gt;Intel 5-level page table&lt;/a&gt; in the following figure as an example. The virtual address is divided from low to high:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low: Page offset&lt;/li&gt;
&lt;li&gt;Level 1: Direct page table index&lt;/li&gt;
&lt;li&gt;Level 2: Page middle directory index&lt;/li&gt;
&lt;li&gt;Level 3: Page upper directory index&lt;/li&gt;
&lt;li&gt;Level 4: Page 4-level directory index&lt;/li&gt;
&lt;li&gt;Level 5: Page global index&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fintel-5-level-paging.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fintel-5-level-paging.png" alt="Intel 5-level paging - Virtual address"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Intel 5-level paging - Wikipedia &lt;/p&gt;

&lt;p&gt;The page frame number of the physical memory is stored in the direct page table entry, and you can find it through the direct page table index. &lt;strong&gt;The physical address is the combination of the found page frame number and the page offset.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Suppose you want to change the corresponding physical page in a direct page table entry. You only need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Allocate a new page.&lt;/li&gt;
&lt;li&gt;Copy the data of the old page to the new one.&lt;/li&gt;
&lt;li&gt;Modify the value of the direct page table entry to the new page frame number.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These operations do not change the original virtual address, and you can migrate such pages at will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For the linear mapping area, the virtual address equals the physical address plus the constant.&lt;/strong&gt; Modifying the physical address changes the virtual address, and accessing the original virtual address causes a bug. Therefore, it is not recommended to migrate these pages.&lt;/p&gt;

&lt;p&gt;When the physical pages accessed through the page table and the pages accessed through linear mapping are mixed and managed together, memory fragmentation is prone to occur. Therefore, &lt;strong&gt;the kernel defines several migration types based on the mobility of the pages and groups the pages by the migration types for defragmentation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Among the defined migration types, the three most frequently used are: &lt;strong&gt;MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, and MIGRATE_RECLAIMABLE&lt;/strong&gt;. Other migration types have special purposes, which I won't describe here.&lt;/p&gt;

&lt;p&gt;You can view the distribution of each migration type at each stage through &lt;code&gt;/proc/pagetypeinfo&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fpagetypeinfo-command.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fpagetypeinfo-command.png" alt="Linux  raw `/proc/pagetypeinfo` endraw  command"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When applying for a page, the page allocation flag you use determines the specific migration type from which the page is allocated.&lt;/strong&gt; For example, you can use &lt;code&gt;__GFP_MOVABLE&lt;/code&gt; for user space memory, and &lt;code&gt;__GFP_RECLAIMABLE&lt;/code&gt; for file pages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When pages of a certain migration type are used up, the kernel steals physical pages from other migration types.&lt;/strong&gt; To avoid fragmentation, the page stealing starts from the largest page block. The page block size is determined by &lt;code&gt;pageblock_order&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The standby priorities of the above three migration types from high to low are:&lt;/p&gt;

&lt;p&gt;MIGRATE_UNMOVABLE:        MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE MIGRATE_RECALIMABlE:      MIGRATE_UNMOVABLE, MIGRATE_MOVABLE MIGRATE_MOVABLE:             MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE&lt;/p&gt;

&lt;p&gt;The kernel introduces grouping by migration types for defragmentation. But frequent page stealing indicates that there are external memory fragmentation events, and they might cause trouble in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyze external memory fragmentation events
&lt;/h2&gt;

&lt;p&gt;My previous article &lt;a href="https://en.pingcap.com/blog/why-we-disable-linux-thp-feature-for-databases" rel="noopener noreferrer"&gt;Why We Disable Linux's THP Feature for Databases&lt;/a&gt; mentioned that you can use ftrace events provided by the kernel to analyze external memory fragmentation events. The procedure is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Enable the ftrace events:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;1&amp;gt; /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Start collecting the ftrace events:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/kernel/debug/tracing/trace_pipe&amp;gt; ~/extfrag.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tap Ctrl-C to stop collecting. A event contains many fields:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-ftrace-events-fields.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-ftrace-events-fields.png" alt="Linux ftrace events' fields"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To analyze the number of external memory fragmentation events, focus on &lt;strong&gt;the events with &lt;code&gt;fallback_order &amp;amp;lt; pageblock order&lt;/code&gt;&lt;/strong&gt;. In the x86_64 environment, &lt;code&gt;pageblock order&lt;/code&gt; is 9.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Clean up the events:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;0&amp;gt; /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can see that grouping by migration types only delays memory fragmentation, but does not fundamentally solve it.&lt;/p&gt;

&lt;p&gt;As the memory fragmentation increases and it does not have enough contiguous physical memory, performance degrades. So, it's not enough to apply this feature alone.&lt;/p&gt;

&lt;p&gt;In my next article, I'll introduce more methods that the kernel uses to regulate memory fragmentation.&lt;/p&gt;

&lt;p&gt;To be continued…&lt;/p&gt;

&lt;p&gt;Originally pulished at &lt;a href="https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1/" rel="noopener noreferrer"&gt;Linux Kernel vs. Memory Fragmentation (Part I)&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Trace Linux System Calls in Production with Minimal Impact on Performance</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Thu, 21 Jan 2021 15:44:56 +0000</pubDate>
      <link>https://dev.to/ethercflow/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance-ndh</link>
      <guid>https://dev.to/ethercflow/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance-ndh</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fhow-to-trace-linux-syscalls.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fhow-to-trace-linux-syscalls.jpg" alt="How to trace Linux System Calls in Production with Minimal Impact on Performance"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you need to dynamically trace Linux process system calls, you might first consider strace. strace is simple to use and works well for issues such as "Why can't the software run on this machine?" However, if you're running a trace in a production environment, strace is NOT a good choice. It introduces a substantial amount of overhead. According to &lt;a href="http://vger.kernel.org/~acme/perf/linuxdev-br-2018-perf-trace-eBPF/#/4/2" rel="noopener noreferrer"&gt;a performance test&lt;/a&gt; conducted by Arnaldo Carvalho de Melo, a senior software engineer at Red Hat, &lt;strong&gt;the process traced using strace ran 173 times slower, which is disastrous for a production environment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So are there any tools that excel at tracing system calls in a production environment? The answer is YES. This blog post introduces perf and traceloop, two commonly used command-line tools, to help you trace system calls in a production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  perf, a performance profiler for Linux
&lt;/h2&gt;

&lt;p&gt;perf is a powerful Linux profiling tool, refined and upgraded by Linux kernel developers. In addition to common features such as analyzing Performance Monitoring Unit (PMU) hardware events and kernel events, perf has the following subcomponents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sched: Analyzes scheduler actions and latencies.&lt;/li&gt;
&lt;li&gt;timechart: Visualizes system behaviors based on the workload.&lt;/li&gt;
&lt;li&gt;c2c: Detects the potential for false sharing. Red Hat once tested the c2c prototype on a number of Linux applications and found many cases of false sharing and cache lines on hotspots.&lt;/li&gt;
&lt;li&gt;trace: Traces system calls with acceptable overheads. It performs only &lt;strong&gt;1.36&lt;/strong&gt; times slower with workloads specified in the &lt;code&gt;dd&lt;/code&gt; command.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's look at some common uses of perf.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;To see which commands made the most system calls:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf top &lt;span class="nt"&gt;-F&lt;/span&gt; 49 &lt;span class="nt"&gt;-e&lt;/span&gt; raw_syscalls:sys_enter &lt;span class="nt"&gt;--sort&lt;/span&gt; &lt;span class="nb"&gt;comm&lt;/span&gt;,dso &lt;span class="nt"&gt;--show-nr-samples&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-call-counts.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-call-counts.jpg" alt="System call counts"&gt;&lt;/a&gt;&lt;/p&gt;

 System call counts 

&lt;p&gt;From the output, you can see that the &lt;code&gt;kube-apiserver&lt;/code&gt; command had the most system calls during sampling.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To see system calls that have latencies longer than a specific duration. In the following example, this duration is 200 milliseconds:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 200
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-calls-longer-than-200-ms.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-calls-longer-than-200-ms.jpg" alt="System calls longer than 200 ms"&gt;&lt;/a&gt;&lt;/p&gt;

 System calls longer than 200 ms 

&lt;p&gt;From the output, you can see the process names, process IDs (PIDs), the specific system calls that exceed 200 ms, and the returned values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To see the processes that had system calls within a period of time and a summary of their overhead:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf trace &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nv"&gt;$PID&lt;/span&gt;  &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-call-overheads-by-process.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsystem-call-overheads-by-process.jpg" alt="System call overheads by process"&gt;&lt;/a&gt;&lt;/p&gt;

 System call overheads by process 

&lt;p&gt;From the output, you can see the times of each system call, the times of the errors, the total latency, the average latency, and so on.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To analyze the stack information of calls that have a high latency:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf trace record &lt;span class="nt"&gt;--call-graph&lt;/span&gt; dwarf &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nv"&gt;$PID&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fstack-information-of-system-calls-with-high-latency.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fstack-information-of-system-calls-with-high-latency.jpg" alt="Stack information of system calls with high latency"&gt;&lt;/a&gt;&lt;/p&gt;

 Stack information of system calls with high latency &lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To trace a group of tasks. For example, two BPF tools are running in the background. To see their system call information, you can add them to a &lt;code&gt;perf_event&lt;/code&gt; cgroup and then execute &lt;code&gt;per trace&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; /sys/fs/cgroup/perf_event/bpftools/
&lt;span class="nb"&gt;echo &lt;/span&gt;22542 &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/perf_event/bpftools/tasks
&lt;span class="nb"&gt;echo &lt;/span&gt;20514 &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/perf_event/bpftools/tasks
perf trace &lt;span class="nt"&gt;-G&lt;/span&gt; bpftools &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftrace-a-group-of-tasks.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftrace-a-group-of-tasks.jpg" alt="Trace a group of tasks"&gt;&lt;/a&gt;&lt;/p&gt;

 Trace a group of tasks &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are some of the most common uses of perf. If you'd like to know more (especially about perf-trace), see the &lt;a href="https://man7.org/linux/man-pages/man1/perf-trace.1.html" rel="noopener noreferrer"&gt;Linux manual page&lt;/a&gt;. From the manual pages, you will learn that perf-trace can filter tasks based on PIDs or thread IDs (TIDs), but that it has no convenient support for containers and the Kubernetes (K8s) environments. Don't worry. Next, we'll discuss a tool that can easily trace system calls in containers and in K8s environments that uses cgroup v2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traceloop, a performance profiler for cgroup v2 and K8s
&lt;/h2&gt;

&lt;p&gt;Traceloop provides better support for tracing Linux system calls in the containers or K8s environments that use cgroup v2. You might be unfamiliar with traceloop but know BPF Compiler Collection (BCC) pretty well. (Its front-end is implemented using Python or C++.) In the IO Visor Project, BCC's parent project, there is another project named gobpf that provides Golang bindings for the BCC framework. Based on gobpf, traceloop is developed for environments of containers and K8s. The following illustration shows the traceloop architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftraceloop-architecture.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftraceloop-architecture.jpg" alt="traceloop architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;traceloop architecture &lt;/p&gt;

&lt;p&gt;We can further simplify this illustration into the following key procedures. Note that these procedures are implementation details, not operations to perform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;bpf helper&lt;/code&gt; gets the cgroup ID. Tasks are filtered based on the cgroup ID rather than on the PID and TID.&lt;/li&gt;
&lt;li&gt;Each cgroup ID corresponds to a &lt;a href="https://ebpf.io/what-is-ebpf/#tail--function-calls" rel="noopener noreferrer"&gt;bpf tail call&lt;/a&gt; that can call and execute another eBPF program and replace the execution context. Syscall events are written through a bpf tail call to a perf ring buffer with the same cgroup ID.&lt;/li&gt;
&lt;li&gt;The user space reads the perf ring buffer based on this cgroup ID.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Currently, you can get the cgroup ID only by executing &lt;code&gt;bpf helper: bpf_get_current_cgroup_id&lt;/code&gt;, and this ID is available only in cgroup v2. Therefore, before you use traceloop, make sure that &lt;a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#basic-operations" rel="noopener noreferrer"&gt;cgroup v2 is enabled&lt;/a&gt; in your environment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the following demo (on the CentOS 8 4.18 kernel), when traceloop exits, the system call information is traced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; ./traceloop cgroups &lt;span class="nt"&gt;--dump-on-exit&lt;/span&gt; /sys/fs/cgroup/system.slice/sshd.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftraceloop-tracing-system-calls.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftraceloop-tracing-system-calls.jpg" alt="traceloop tracing system calls"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;traceloop tracing system calls &lt;/p&gt;

&lt;p&gt;As the results show, the traceloop output is similar to that of strace or perf-trace except for the cgroup-based task filtering. Note that CentOS 8 mounts cgroup v2 directly on the &lt;code&gt;/sys/fs/cgroup&lt;/code&gt; path instead of on &lt;code&gt;/sys/fs/cgroup/unified&lt;/code&gt; as Ubuntu does. Therefore, before you use traceloop, you should run &lt;code&gt;mount -t cgroup2&lt;/code&gt; to determine the mount information.&lt;/p&gt;

&lt;p&gt;The team behind traceloop has integrated it with the Inspektor Gadget project, so you can run traceloop on the K8s platform using kubectl. See the demos in &lt;a href="https://github.com/kinvolk/inspektor-gadget#how-to-use" rel="noopener noreferrer"&gt;Inspektor Gadget - How to use&lt;/a&gt; and, if you like, try it on your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark with system calls traced
&lt;/h2&gt;

&lt;p&gt;We conducted a sysbench test in which system calls were either traced using multiple tracers (traceloop, strace, and perf-trace) or not traced. The benchmark results are as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsysbench-results-with-system-calls-traced-and-untraced.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fsysbench-results-with-system-calls-traced-and-untraced.jpg" alt="Sysbench results with system calls traced and untraced"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sysbench results with system calls traced and untraced &lt;/p&gt;

&lt;p&gt;As the benchmark shows, strace caused the biggest decrease in application performance. perf-trace caused a smaller decrease, and traceloop caused the smallest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary of Linux profilers
&lt;/h2&gt;

&lt;p&gt;For issues such as "Why can't the software run on this machine," strace is still a powerful system call tracer in Linux. But to trace the latency of system calls, the BPF-based perf-trace is a better option. In containers or K8s environments that use cgroup v2, traceloop is the easiest to use.&lt;/p&gt;

&lt;p&gt;Originally pulished at &lt;a href="https://en.pingcap.com/blog/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance" rel="noopener noreferrer"&gt;How to Trace Linux System Calls in Production with Minimal Impact on Performance&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Tips and Tricks for Writing Linux BPF Applications with libbpf</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Tue, 12 Jan 2021 15:04:08 +0000</pubDate>
      <link>https://dev.to/ethercflow/tips-and-tricks-for-writing-linux-bpf-applications-with-libbpf-24p1</link>
      <guid>https://dev.to/ethercflow/tips-and-tricks-for-writing-linux-bpf-applications-with-libbpf-24p1</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-bpf-performance-analysis-tools.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Flinux-bpf-performance-analysis-tools.jpg" alt="Linux BPF performance analysis tools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the beginning of 2020, when I used the BCC tools to analyze our database performance bottlenecks, and pulled the code from the GitHub, I accidentally discovered that there was an additional &lt;a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools" rel="noopener noreferrer"&gt;libbpf-tools&lt;/a&gt; directory in the BCC project. I had read an article on &lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html" rel="noopener noreferrer"&gt;BPF portability&lt;/a&gt; and another on &lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html" rel="noopener noreferrer"&gt;BCC to libbpf conversion&lt;/a&gt;, and I used what I learned to convert my previously submitted bcc-tools to libbpf-tools. I ended up converting nearly 20 tools. (See &lt;a href="https://pingcap.com/blog/why-we-switched-from-bcc-tools-to-libbpf-tools-for-bpf-performance-analysis" rel="noopener noreferrer"&gt;Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis&lt;/a&gt;.) &lt;/p&gt;

&lt;p&gt;During this process, I was fortunate to get a lot of help from &lt;a href="https://github.com/anakryiko" rel="noopener noreferrer"&gt;Andrii Nakryiko&lt;/a&gt; (the libbpf + BPF CO-RE project's leader). It was fun and I learned a lot. In this post, I'll share my experience about writing Berkeley Packet Filter (BPF) applications with libbpf. I hope this article is helpful to people who are interested in libbpf and inspires them to further develop and improve BPF applications with libbpf. &lt;/p&gt;

&lt;p&gt;Before you read further, however, consider reading &lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html" rel="noopener noreferrer"&gt;these posts for important background information:&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html" rel="noopener noreferrer"&gt;BPF Portability and CO-RE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html" rel="noopener noreferrer"&gt;HOWTO: BCC to libbpf conversion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nakryiko.com/posts/libbpf-bootstrap/" rel="noopener noreferrer"&gt;Building BPF applications with libbpf-boostrap&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article assumes that you've already read these posts, so there won't be any systematic descriptions. Instead, I'll offer you some tips for certain parts of the program.&lt;/p&gt;

&lt;h2&gt;
  
  
  Program skeleton
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Combining the open and load phases
&lt;/h3&gt;

&lt;p&gt;If your BPF code doesn't need any runtime adjustments (for example, adjusting the map size or setting an extra configuration), you can call &lt;code&gt;&amp;lt;name&amp;gt;__open_and_load()&lt;/code&gt; to combine the two phases into one. This makes our code look more compact. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;readahead_bpf__open_and_load&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"failed to open and/or load BPF object&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;readahead_bpf__attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/readahead.c#L75" rel="noopener noreferrer"&gt;readahead.c&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selective attach
&lt;/h3&gt;

&lt;p&gt;By default, &lt;code&gt;&amp;lt;name&amp;gt;__attach()&lt;/code&gt; attaches all auto-attachable BPF programs. However, sometimes you might want to selectively attach the corresponding BPF program according to the command line parameters. In this case, you can call &lt;code&gt;bpf_program__attach()&lt;/code&gt; instead. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;biolatency_bpf__load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queued&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_insert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
                &lt;span class="n"&gt;bpf_program__attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;progs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_insert&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;libbpf_get_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_insert&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_issue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;bpf_program__attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;progs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_issue&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;libbpf_get_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_rq_issue&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/biolatency.c#L264" rel="noopener noreferrer"&gt;biolatency.c&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom load and attach
&lt;/h3&gt;

&lt;p&gt;Skeleton is suitable for almost all scenarios, but there is a special case: perf events. In this case, instead of using links from &lt;code&gt;struct &amp;lt;name&amp;gt;__bpf&lt;/code&gt;, you need to define an array: &lt;code&gt;struct bpf_link *links[]&lt;/code&gt;. The reason is that &lt;code&gt;perf_event&lt;/code&gt; needs to be opened separately on each CPU. &lt;/p&gt;

&lt;p&gt;After this, open and attach &lt;code&gt;perf_event&lt;/code&gt; by yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;open_and_attach_perf_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;bpf_program&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;prog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;bpf_link&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;perf_event_attr&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PERF_TYPE_SOFTWARE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_period&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PERF_COUNT_SW_CPU_CLOCK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;nr_cpus&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__NR_perf_event_open&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"failed to init perf sampling: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;strerror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
                        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_program__attach_perf_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libbpf_get_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"failed to attach perf event on cpu: "&lt;/span&gt;
                                &lt;span class="s"&gt;"%d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                        &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                        &lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; 

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, during the tear down phase, remember to destroy each link in the &lt;code&gt;links&lt;/code&gt; and then destroy &lt;code&gt;links&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlen.c" rel="noopener noreferrer"&gt;runqlen.c&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multiple handlers for the same event
&lt;/h3&gt;

&lt;p&gt;Starting in &lt;a href="https://github.com/libbpf/libbpf/releases/tag/v0.2" rel="noopener noreferrer"&gt;v0.2&lt;/a&gt;, libbpf supports multiple entry-point BPF programs within the same executable and linkable format (ELF) section. Therefore, you can attach multiple BPF programs to the same event (such as tracepoints or kprobes) without worrying about ELF section name clashes. For details, see &lt;a href="https://patchwork.ozlabs.org/project/netdev/cover/20200903203542.15944-1-andriin@fb.com/" rel="noopener noreferrer"&gt;Add libbpf full support for BPF-to-BPF calls&lt;/a&gt;. Now, you can naturally define multiple handlers for an event like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/irq_handler_entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;irq_handler_entry1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;irq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;irqaction&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/irq_handler_entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;irq_handler_entry2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/ethercflow/libbpf-bootstrap/blob/master/src/hardirqs.bpf.c" rel="noopener noreferrer"&gt;hardirqs.bpf.c&lt;/a&gt; (built with &lt;a href="https://github.com/libbpf/libbpf-bootstrap" rel="noopener noreferrer"&gt;libbpf-bootstrap&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;If your libbpf version is earlier than v2.0, to define multiple handlers for an event, you have to use multiple program types, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tracepoint/irq/irq_handler_entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;handle__irq_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;trace_event_raw_irq_handler_entry&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/irq_handler_entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;irq_handler_entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/hardirqs.bpf.c" rel="noopener noreferrer"&gt;hardirqs.bpf.c&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reduce pre-allocation overhead
&lt;/h3&gt;

&lt;p&gt;Beginning in Linux 4.6, BPF hash maps perform memory pre-allocation by default and introduce the &lt;code&gt;BPF_F_NO_PREALLOC&lt;/code&gt; flag. The motivation for doing so is to avoid kprobe + bpf deadlocks. The community had tried other solutions, but in the end, pre-allocating all the map elements was the simplest solution and didn't affect the user space visible behavior.&lt;/p&gt;

&lt;p&gt;When full map pre-allocation is too memory expensive, define the map with the &lt;code&gt;BPF_F_NO_PREALLOC&lt;/code&gt; flag to keep old behavior. For details, see &lt;a href="https://lore.kernel.org/patchwork/cover/656547/" rel="noopener noreferrer"&gt;bpf: map pre-alloc&lt;/a&gt;. When the map size is not large (such as &lt;code&gt;MAX_ENTRIES&lt;/code&gt; = 256), this flag is not necessary. &lt;code&gt;BPF_F_NO_PREALLOC&lt;/code&gt; is slower. &lt;/p&gt;

&lt;p&gt;Here is an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_MAP_TYPE_HASH&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_ENTRIES&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;map_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_F_NO_PREALLOC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="nf"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".maps"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see many cases in &lt;a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools" rel="noopener noreferrer"&gt;libbpf-tools&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Determine the map size at runtime
&lt;/h3&gt;

&lt;p&gt;One advantage of libbpf-tools is that it is portable, so the maximum space required for the map may be different for different machines. In this case, you can define the map without specifying the size and resize it before load. For example:&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;&amp;lt;name&amp;gt;.bpf.c&lt;/code&gt;, define the map as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_MAP_TYPE_HASH&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="nf"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".maps"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the open phase, call &lt;code&gt;bpf_map__resize()&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;cpudist_bpf&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cpudist_bpf__open&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;bpf_map__resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;maps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pid_max&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/cpudist.c#L223" rel="noopener noreferrer"&gt;cpudist.c&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-CPU
&lt;/h3&gt;

&lt;p&gt;When you select the map type, if multiple events are associated and occur on the same CPU, using a per-CPU array to track the timestamp is much simpler and more efficient than using a hashmap. However, you must be sure that the kernel doesn't migrate the process from one CPU to another between two BPF program invocations. So you can't always use this trick. The following example analyzes soft interrupts, and it meets both these conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_MAP_TYPE_PERCPU_ARRAY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="nf"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".maps"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/softirq_entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;softirq_entry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;vec_nr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;u64&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_ktime_get_ns&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 

        &lt;span class="n"&gt;bpf_map_update_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/softirq_exit"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;softirq_exit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;vec_nr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;u64&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tsp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="p"&gt;[...]&lt;/span&gt;
        &lt;span class="n"&gt;tsp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_map_lookup_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/softirqs.bpf.c" rel="noopener noreferrer"&gt;softirqs.bpf.c&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Global variables
&lt;/h2&gt;

&lt;p&gt;Not only can you use global variables to customize BPF program logic, you can use them instead of maps to make your program simpler and more efficient. Global variables can be any size. You just need to set global variables to be a fixed size (or at least with a bounded maximum size if you don't mind wasting some memory).&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Because the number of SOFTIRQ types is fixed, you can define global arrays to save counts and histograms in &lt;code&gt;softirq.bpf.c&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;__u64&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NR_SOFTIRQS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt; &lt;span class="n"&gt;hists&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NR_SOFTIRQS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, you can traverse the array directly in user space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;print_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;softirqs_bpf__bss&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nanoseconds&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"nsecs"&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"usecs"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;__u64&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;__u32&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%-16s %6s%5s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SOFTIRQ"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"TOTAL_"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;NR_SOFTIRQS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;__atomic_exchange_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;bss&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="n"&gt;__ATOMIC_RELAXED&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%-16s %11llu&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vec_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the complete code in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/softirqs.c" rel="noopener noreferrer"&gt;softirqs.c&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch out for directly accessing fields through pointers
&lt;/h2&gt;

&lt;p&gt;As you know from the &lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#reading-kernel-structures-fields" rel="noopener noreferrer"&gt;BPF Portability and CO-RE&lt;/a&gt; blog post, the libbpf + &lt;code&gt;BPF_PROG_TYPE_TRACING&lt;/code&gt; approach gives you a smartness of BPF verifier. It understands and tracks BTF natively and allows you to follow pointers and read kernel memory directly (and safely). For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;u64&lt;/span&gt; &lt;span class="n"&gt;inode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;exe_file&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;f_inode&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;i_ino&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is very cool. However, when you use such expressions in conditional statements, there is a bug that this branch is optimized away in some kernel versions. In this case, until &lt;a href="https://www.spinics.net/lists/bpf/msg21897.html" rel="noopener noreferrer"&gt;bpf: fix an incorrect branch elimination by verifier&lt;/a&gt; is widely backported, please use &lt;code&gt;BPF_CORE_READ&lt;/code&gt; for kernel compatibility. You can find an example in &lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/biolatency.bpf.c#L63" rel="noopener noreferrer"&gt;biolatency.bpf.c&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tp_btf/block_rq_issue"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_PROG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_rq_issue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;request_queue&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;targ_queued&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;BPF_CORE_READ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elevator&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;trace_rq_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rq&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see even though it's a &lt;code&gt;tp_btf&lt;/code&gt; program and &lt;code&gt;q-&amp;gt;elevator&lt;/code&gt; will be faster, I have to use &lt;code&gt;BPF_CORE_READ(q, elevator)&lt;/code&gt; instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This article introduced some tips for writing BPF programs with libbpf. You can find many practical examples from &lt;a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools" rel="noopener noreferrer"&gt;libbpf-tools&lt;/a&gt; and &lt;a href="https://github.com/torvalds/linux/tree/master/tools/testing/selftests/bpf" rel="noopener noreferrer"&gt;bpf&lt;/a&gt;. If you have any questions, you can join the &lt;a href="https://slack.tidb.io/invite?team=tidb-community&amp;amp;channel=everyone&amp;amp;ref=pingcap-blog" rel="noopener noreferrer"&gt;TiDB community on Slack&lt;/a&gt; and send us your feedback.&lt;/p&gt;

&lt;p&gt;Originally pulished at &lt;a href="https://en.pingcap.com/blog/tips-and-tricks-for-writing-linux-bpf-applications-with-libbpf/" rel="noopener noreferrer"&gt;PingCAP&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>linux</category>
    </item>
    <item>
      <title>Why We Disable Linux's THP Feature for Databases</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Thu, 07 Jan 2021 14:51:43 +0000</pubDate>
      <link>https://dev.to/ethercflow/why-we-disable-linux-s-thp-feature-for-databases-57</link>
      <guid>https://dev.to/ethercflow/why-we-disable-linux-s-thp-feature-for-databases-57</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fhow-thp-slows-down-your-database-performance-banner.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fhow-thp-slows-down-your-database-performance-banner.jpg" alt="Disabling THP to improve database performance"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Linux's memory management system is transparent to the user. However, if you're not familiar with its working principles, you might meet unexpected performance issues. That's especially true for sophisticated software like databases. When databases are running in Linux, even small system variations might impact performance.&lt;/p&gt;

&lt;p&gt;After an in-depth investigation, we found that &lt;a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html" rel="noopener noreferrer"&gt;Transparent Huge Page&lt;/a&gt; (THP), a Linux memory management feature, often slows down database performance. In this post, I'll describe how THP causes performance to fluctuate, the typical symptoms, and our recommended solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is THP
&lt;/h2&gt;

&lt;p&gt;THP is an important feature of the Linux kernel. It maps page table entries to larger page sizes to reduce page faults. This improves the &lt;a href="https://en.wikipedia.org/wiki/Translation_lookaside_buffer" rel="noopener noreferrer"&gt;translation lookaside buffer&lt;/a&gt; (TLB) hit ratio. TLB is a memory cache used by the memory management unit to improve the translation speed from virtual memory addresses to physical memory addresses.&lt;/p&gt;

&lt;p&gt;When the application data being accessed is contiguous, THP often boosts performance. In contrast, if the memory access patterns are not contiguous, THP can't fulfill its duty, and it may even cause system instability.&lt;/p&gt;

&lt;p&gt;Unfortunately, database workloads are known to have sparse rather than contiguous memory access. Therefore, you should disable THP for your database.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Linux manages its memory
&lt;/h3&gt;

&lt;p&gt;To understand the harm THP can cause, let's consider how Linux manages its physical memory.&lt;/p&gt;

&lt;p&gt;For different architectures, the Linux kernel employs different memory mapping approaches. Among them, the user space maps the memory via multi-level paging to save space, while the kernel space uses linear mapping to achieve simplicity and high efficiency.&lt;/p&gt;

&lt;p&gt;When the kernel starts, it adds physical pages to &lt;a href="https://en.wikipedia.org/wiki/Buddy_memory_allocation" rel="noopener noreferrer"&gt;the buddy system&lt;/a&gt;. Every time the user applies for memory, the buddy system allocates the desired pages. When the user releases memory, the buddy system deallocates the pages.&lt;/p&gt;

&lt;p&gt;To accommodate low-speed devices and various workloads, Linux divides the memory pages into anonymous pages and file-based pages. Linux uses page cache to cache files for low-speed devices. When memory is insufficient, users can employ swap cache and swappiness to specify a proportion of the two types of pages to be released.&lt;/p&gt;

&lt;p&gt;To respond to the user's memory application as soon as possible and guarantee that the system runs normally when the memory resources are insufficient, Linux defines three watermarks: &lt;code&gt;high&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, and &lt;code&gt;min&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the unused physical memory is less than &lt;code&gt;low&lt;/code&gt; and more than &lt;code&gt;min&lt;/code&gt;, when the user applies for memory, the page replacement daemon &lt;code&gt;kswapd&lt;/code&gt; asynchronously frees memory until the available physical memory is higher than &lt;code&gt;high&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If the asynchronous memory reclaim can't keep up with the memory application, Linux triggers the synchronous direct reclaim. In such cases, all memory-related threads synchronously take part in freeing memory. When enough memory becomes available, the threads start to get the memory space they apply for.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During the direct reclaim, if the pages are clean, the blockage caused by synchronous reclaim is short; otherwise, it might result in tens of milliseconds of latency, and, depending on the back-end devices, sometimes even seconds.&lt;/p&gt;

&lt;p&gt;Apart from the watermarks, another mechanism may also cause direct memory reclaim. Sometimes, a thread applies for a large section of continuous memory pages. If there is enough physical memory, but it's fragmented, the kernel performs memory compaction. This might also trigger a direct memory reclaim.&lt;/p&gt;

&lt;p&gt;To sum up, when threads apply for memory, the major causes of latency are direct memory reclaim and memory compaction. For workloads whose memory access is not very contiguous, such as databases, THP may trigger the two tasks and thus cause fluctuating performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  When THP causes performance fluctuation
&lt;/h2&gt;

&lt;p&gt;If your system performance fluctuates, how can you be sure THP is the cause? I'd like to share three symptoms that we've found are related to THP.&lt;/p&gt;

&lt;h3&gt;
  
  
  The most typical symptom: &lt;code&gt;sys cpu&lt;/code&gt; rises
&lt;/h3&gt;

&lt;p&gt;Based on our customer support experience, the most typical symptom of THP-caused performance fluctuation is sharply rising system CPU utilization.&lt;/p&gt;

&lt;p&gt;In such cases, if you create an on-cpu flame graph using &lt;a href="https://en.wikipedia.org/wiki/Perf_(Linux)" rel="noopener noreferrer"&gt;perf&lt;/a&gt;, you'll see that all the service threads that are in the runnable state are performing memory compaction. In addition, the page fault exception handler is &lt;code&gt;do_huge_pmd_anonymous_page&lt;/code&gt;. This means that the current system doesn't have 2 MB of contiguous physical memory and that triggers the direct memory compaction. The direct memory compaction is time-consuming, so it leads to high system CPU utilization.&lt;/p&gt;

&lt;h3&gt;
  
  
  The indirect symptom: &lt;code&gt;sys load&lt;/code&gt; rises
&lt;/h3&gt;

&lt;p&gt;Many memory issues are not as obvious as those described above. When the system allocates or other high-level memory, it doesn't perform memory compaction directly and leave you an obvious trace. Instead, it often mixes the compaction with other tasks, such as direct memory reclaim.&lt;/p&gt;

&lt;p&gt;Involving direct reclaim in the process makes our troubleshooting more perplexing. For example, when the unused physical memory in the normal zone is higher than the &lt;code&gt;high&lt;/code&gt; watermark, the system still continuously reclaims memory. To get to the bottom of this, we need to dive deeper into the processing logic of slow memory allocation.&lt;/p&gt;

&lt;p&gt;The slow memory allocation breaks down into four major steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Asynchronous memory compaction&lt;/li&gt;
&lt;li&gt;Direct memory reclaim&lt;/li&gt;
&lt;li&gt;Direct memory compaction&lt;/li&gt;
&lt;li&gt;Out of memory (OOM) collection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After each step, the system tries to allocate memory. If the allocation succeeds, the system returns the allocated page and skips the remaining steps. For each allocation, the kernel provides a fragmentation index for each order in the buddy system, which indicates whether the allocation failure is caused by insufficient memory or by fragmented memory.&lt;/p&gt;

&lt;p&gt;The fragmentation index is associated with the &lt;code&gt;/proc/sys/vm/extfrag_threshold&lt;/code&gt; parameter. The closer the number is to 1,000, the more the allocation failure is related to memory fragmentation, and the kernel is more likely to perform memory compaction. The closer the number is to 0, the more the allocation failure is related to insufficient memory, and the kernel is more inclined to perform memory reclaim.&lt;/p&gt;

&lt;p&gt;Therefore, even when the unused memory is higher than the &lt;code&gt;high&lt;/code&gt; watermark, the system may also frequently reclaim memory. Because THP consumes high-level memory, it compounds the performance fluctuation caused by memory fragmentation.&lt;/p&gt;

&lt;p&gt;To verify whether the performance fluctuation is related to memory fragmentation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;View the direct memory reclaim operations taken per second. Execute &lt;code&gt;sar -b&lt;/code&gt; to observe &lt;code&gt;pgscand/s&lt;/code&gt;. If this number is greater than 0 for a consecutive period of time, take the following steps to troubleshoot the problem.&lt;/li&gt;
&lt;li&gt;Observe the memory fragmentation index. Execute &lt;code&gt;cat /sys/kernel/debug/extfrag/extfrag_index&lt;/code&gt; to get the index*&lt;em&gt;.&lt;/em&gt;* Focus on the fragmentation index of the block whose order is &amp;gt;= 3. If the number is close to 1,000, the fragmentation is severe; if it's close to 0, the memory is insufficient.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;View the memory fragmentation status. Execute &lt;code&gt;cat /proc/buddyinfo&lt;/code&gt; and &lt;code&gt;cat /proc/pagetypeinfo&lt;/code&gt; to show the status. (Refer to the &lt;a href="https://man7.org/linux/man-pages/man5/proc.5.html" rel="noopener noreferrer"&gt;Linux manual page&lt;/a&gt; for details.) Focus on the number of pages whose order is &amp;gt;= 3.&lt;/p&gt;

&lt;p&gt;Compared to &lt;code&gt;buddyinfo&lt;/code&gt;, &lt;code&gt;pagetypeinfo&lt;/code&gt; displays more detailed information grouped by migration types. The buddy system implements anti-fragmentation through migration types. Note that if all the &lt;code&gt;Unmovable&lt;/code&gt; pages are grouped in order &amp;lt; 3, the kernel slab objects have severe fragmentation. In such cases, you need to troubleshoot the specific cause of the problem using other tools.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For kernels that support the &lt;a href="https://en.wikipedia.org/wiki/Berkeley_Packet_Filter" rel="noopener noreferrer"&gt;Berkeley Packet Filter&lt;/a&gt; (BPF), such as CentOS 7.6, &lt;strong&gt;you may also perform quantitative analysis on the latency&lt;/strong&gt; using &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/drsnoop_example.txt" rel="noopener noreferrer"&gt;drsnoop&lt;/a&gt; or &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/compactsnoop_example.txt" rel="noopener noreferrer"&gt;compactsnoop&lt;/a&gt; developed by PingCAP.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;(Optional) &lt;strong&gt;Trace the &lt;code&gt;mm_page_alloc_extfrag&lt;/code&gt; event with &lt;a href="https://en.wikipedia.org/wiki/Ftrace" rel="noopener noreferrer"&gt;ftrace&lt;/a&gt;&lt;/strong&gt;. Due to memory fragmentation, the migration type steals physical pages from the backup migration type.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The atypical symptom: abnormal RES usage
&lt;/h3&gt;

&lt;p&gt;Sometimes, when the service starts on an AARCH64 server, dozens of gigabytes of physical memory are occupied. By viewing the &lt;code&gt;/proc/pid/smaps&lt;/code&gt; file, you may see that most memory is used for THP. Because AARCH64's CentOS 7 kernel sets its page size as 64 KB, its resident memory usage is many times larger than that of the x86_64 platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to deal with THP
&lt;/h2&gt;

&lt;p&gt;For applications that are not optimized to store their data contiguously, or applications that have sparse workloads, enabling THP and THP defrag is detrimental to the long-running services.&lt;/p&gt;

&lt;p&gt;Before Linux v4.6, the kernel doesn't provide &lt;code&gt;defer&lt;/code&gt; or &lt;code&gt;defer + madvise&lt;/code&gt; for THP defrag. Therefore, for CentOS 7, which uses the v3.10 kernel, it is recommended to disable THP. If your applications do need THP, however, we suggest that you set THP as &lt;code&gt;madvise&lt;/code&gt;, which allocates THP via the &lt;a href="https://www.man7.org/linux/man-pages/man2/madvise.2.html" rel="noopener noreferrer"&gt;madvise system call&lt;/a&gt;. Otherwise, setting THP as &lt;code&gt;never&lt;/code&gt; is the best choice for your application.&lt;/p&gt;

&lt;p&gt;To disable THP:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;View the current THP configuration:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/kernel/mm/transparent_hugepage/enabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the value is &lt;code&gt;always&lt;/code&gt;, execute the following commands:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;never &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/kernel/mm/transparent_hugepage/enabled
&lt;span class="nb"&gt;echo &lt;/span&gt;never &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/kernel/mm/transparent_hugepage/defrag
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note that if you restart the server, THP might be turned on again. You can write the two commands in the &lt;code&gt;.service&lt;/code&gt; file and let &lt;a href="https://en.wikipedia.org/wiki/Systemd" rel="noopener noreferrer"&gt;systemd&lt;/a&gt; manage it for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Join our community
&lt;/h2&gt;

&lt;p&gt;If you have any other questions about database performance tuning, or would like to share your expertise, feel free to join the &lt;a href="https://slack.tidb.io/invite?team=tidb-community&amp;amp;channel=everyone&amp;amp;ref=pingcap-blog" rel="noopener noreferrer"&gt;TiDB Community Slack&lt;/a&gt; workspace.&lt;/p&gt;

&lt;p&gt;Originally pulished at &lt;a href="https://en.pingcap.com/blog/why-we-disable-linux-thp-feature-for-databases" rel="noopener noreferrer"&gt;pingcap.com&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis</title>
      <dc:creator>Wenbo Zhang</dc:creator>
      <pubDate>Mon, 14 Dec 2020 08:00:10 +0000</pubDate>
      <link>https://dev.to/ethercflow/why-we-switched-from-bcc-tools-to-libbpf-tools-for-bpf-performance-analysis-3f09</link>
      <guid>https://dev.to/ethercflow/why-we-switched-from-bcc-tools-to-libbpf-tools-for-bpf-performance-analysis-3f09</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oeOAEmw3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/bcc-vs-libbpf-bpf-performance-analysis.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oeOAEmw3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/bcc-vs-libbpf-bpf-performance-analysis.jpg" alt="BPF Linux, BPF performance tools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Distributed clusters might encounter performance problems or unpredictable failures, especially when they are running in the cloud. Of all the kinds of failures, kernel failures may be the most difficult to analyze and simulate. &lt;/p&gt;

&lt;p&gt;A practical solution is &lt;a href="https://en.wikipedia.org/wiki/Berkeley_Packet_Filter"&gt;Berkeley Packet Filter&lt;/a&gt; (BPF), a highly flexible, efficient virtual machine that runs in the Linux kernel. It allows bytecode to be safely executed in various hooks, which exist in a variety of Linux kernel subsystems. BPF is mainly used for networking, tracing, and security.&lt;/p&gt;

&lt;p&gt;Based on BPF, there are two development modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://github.com/iovisor/bcc"&gt;BPF Compiler Collection&lt;/a&gt; (BCC) toolkit offers many useful resources and examples to construct effective kernel tracing and manipulation programs. However, it has disadvantages. &lt;/li&gt;
&lt;li&gt;libbpf + BPF CO-RE (Compile Once – Run Everywhere) is a different development and deployment mode than the BCC framework. &lt;strong&gt;It greatly reduces storage space and runtime overhead, which enables BPF to support more hardware environments, and it optimizes programmers' development experience.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, I'll describe why libbpf-tools, a collection of applications based on the libbpf + BPF CO-RE mode, is a better solution than bcc-tools and how we're using libbpf-tools at &lt;a href="https://pingcap.com/"&gt;PingCAP&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why libbpf + BPF CO-RE is better than BCC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  BCC vs. libbpf + BPF CO-RE
&lt;/h3&gt;

&lt;p&gt;BCC embeds LLVM or Clang to rewrite, compile, and load BPF programs. Although it does its best to simplify BPF developers' work, it has these drawbacks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It uses the Clang front-end to modify user-written BPF programs. When a problem occurs, it's difficult to find the problem and figure out a solution. &lt;/li&gt;
&lt;li&gt;You must remember naming conventions and automatically generated tracepoint structs. &lt;/li&gt;
&lt;li&gt;Because the libbcc library contains a huge LLVM or Clang library, when you use it, you might encounter some issues:

&lt;ul&gt;
&lt;li&gt;When a tool starts, it takes many CPU and memory resources to compile the BPF program. If it runs on a server that lacks system resources, it might trigger a problem.&lt;/li&gt;
&lt;li&gt;BCC depends on kernel header packages, which you must install on each target host. If you need unexported content in the kernel, you must manually copy and paste the type definition into the BPF code.&lt;/li&gt;
&lt;li&gt;Because BPF programs are compiled during runtime, many simple compilation errors can only be detected at runtime. This affects your development experience.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By contrast, BPF CO-RE has these advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When you implement BPF CO-RE, you can directly use the libbpf library provided by kernel developers to develop BPF programs. The development method is the same as writing ordinary C user-mode programs: one compilation generates a small binary file. &lt;/li&gt;
&lt;li&gt;Libbpf acts like a BPF program loader and relocates, loads, and checks BPF programs. BPF developers only need to focus on the BPF programs' correctness and performance. &lt;/li&gt;
&lt;li&gt;This approach minimizes overhead and removes huge dependencies, which makes the overall development process smoother.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details, see &lt;a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html#why-libbpf-and-bpf-co-re"&gt;Why libbpf and BPF CO-RE?&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance comparison
&lt;/h3&gt;

&lt;p&gt;Performance optimization master &lt;a href="https://github.com/brendangregg"&gt;Brendan Gregg&lt;/a&gt; used libbpf + BPF CO-RE to convert a BCC tool and compared their performance data. &lt;a href="https://github.com/iovisor/bcc/pull/2778#issuecomment-594202408"&gt;He said&lt;/a&gt;: "As my colleague Jason pointed out, &lt;strong&gt;the memory footprint of opensnoop as CO-RE is much lower than opensnoop.py&lt;/strong&gt;. &lt;strong&gt;9 Mbytes for CO-RE vs 80 Mbytes for Python&lt;/strong&gt;."&lt;/p&gt;

&lt;p&gt;According to his research, compared with BCC at runtime, libbpf + BPF CO-RE reduced memory overhead by nearly nine times, which greatly benefits servers with scarce physical memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we're using libbpf-tools at PingCAP
&lt;/h2&gt;

&lt;p&gt;At PingCAP, we've been following BPF and its community development for a long time. In the past, every time we added a new machine, we had to install a set of BCC dependencies on it, which was troublesome. After &lt;a href="https://github.com/anakryiko"&gt;Andrii Nakryiko&lt;/a&gt; (the libbpf + BPF CO-RE project's leader) added the first libbpf-tools to the BCC project, we did our research and switched from bcc-tools to libbpf-tools. Fortunately, during the switch, we got guidance from him, Brendan, and &lt;a href="https://github.com/yonghong-song"&gt;Yonghong Song&lt;/a&gt; (the BTF project's leader). We've converted 18 BCC or bpftrace tools to &lt;a href="https://github.com/iovisor/bcc/tree/master/libbpf-tools"&gt;libbpf + BPF CO-RE&lt;/a&gt;, and we're using them in our company. &lt;/p&gt;

&lt;p&gt;For example, when we analyzed the I/O performance of a specific workload, we used multiple performance analysis tools at the block layer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;tr&gt;
   &lt;td&gt;
&lt;strong&gt;Task&lt;/strong&gt;
   &lt;/td&gt;
   &lt;td&gt;
&lt;strong&gt;Performance analysis tool&lt;/strong&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;Check I/O requests' latency distribution
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/biolatency.bpf.c"&gt;./biolatency -d nvme0n1&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;Analyze I/O mode
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/biopattern.bpf.c"&gt;./biopattern -T 1 -d 259:0&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;Check the request size distribution diagram when the task sent physical I/O requests
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/bitesize.bpf.c"&gt;./bitesize -c fio -T&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;Analyze each physical I/O
   &lt;/td&gt;
   &lt;td&gt;
&lt;a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/biosnoop.bpf.c"&gt;./biosnoop -d nvme0n1&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The analysis results helped us optimize I/O performance. We're also exploring whether the scheduler-related libbpf-tools are helpful for tuning the &lt;a href="https://docs.pingcap.com/tidb/stable/"&gt;TiDB database&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;These tools are universal: feel free to give them a try. In the future, we'll implement more tools based on libbpf-tools. If you'd like to learn more about our experience with these tools, you can join the &lt;a href="https://slack.tidb.io/invite?team=tidb-community&amp;amp;channel=everyone&amp;amp;ref=pingcap-blog"&gt;TiDB community on Slack&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This article was originally publish at &lt;a href="https://en.pingcap.com/blog/why-we-switched-from-bcc-tools-to-libbpf-tools-for-bpf-performance-analysis"&gt;pingcap.com&lt;/a&gt; on Dec 3, 2020&lt;/p&gt;

</description>
      <category>linux</category>
    </item>
  </channel>
</rss>
