<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jan Mühlig</title>
    <description>The latest articles on DEV Community by Jan Mühlig (@jmuehlig).</description>
    <link>https://dev.to/jmuehlig</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2285708%2F1fcff8d5-af5a-4c9b-ab32-3111fc9aff85.png</url>
      <title>DEV Community: Jan Mühlig</title>
      <link>https://dev.to/jmuehlig</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jmuehlig"/>
    <language>en</language>
    <item>
      <title>Profiling Specific Code Segments of Applications</title>
      <dc:creator>Jan Mühlig</dc:creator>
      <pubDate>Thu, 05 Dec 2024 09:38:47 +0000</pubDate>
      <link>https://dev.to/jmuehlig/profiling-specific-code-segments-of-applications-gdn</link>
      <guid>https://dev.to/jmuehlig/profiling-specific-code-segments-of-applications-gdn</guid>
      <description>&lt;p&gt;Understanding the interaction between software and hardware has become increasingly essential for building high-performance applications. &lt;br&gt;
The architecture of modern hardware systems has grown significantly in complexity, including deep memory hierarchies and advanced CPUs with features like out-of-order execution and sophisticated branch prediction mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://perfwiki.github.io/main/" rel="noopener noreferrer"&gt;Linux Perf&lt;/a&gt;, &lt;a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html" rel="noopener noreferrer"&gt;Intel VTune&lt;/a&gt;, and &lt;a href="https://www.amd.com/en/developer/uprof.html" rel="noopener noreferrer"&gt;AMD μProf&lt;/a&gt; are helpful tools for understanding how applications use system resources. &lt;br&gt;
However, as these tools are typically designed as external applications, they profile the entire program, making it difficult to focus on specific code segments like particular functions.&lt;br&gt;
This limitation is particularly challenging when analyzing micro-benchmarks, where the measured code may represent only a fraction of the overall runtime, or distinguishing between different phases of an application's execution.&lt;/p&gt;
&lt;h2&gt;
  
  
  Counting Hardware Events
&lt;/h2&gt;

&lt;p&gt;At their core, these tools leverage &lt;em&gt;Performance Monitoring Units&lt;/em&gt; (PMUs)–specialized components designed to track hardware events like &lt;em&gt;cache misses&lt;/em&gt; and &lt;em&gt;branch mispredictions&lt;/em&gt;.&lt;br&gt;
Although these tools are far more powerful, this discussion will focus on the essentials of hardware event counting.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario: Random Access Pattern
&lt;/h3&gt;

&lt;p&gt;Consider a random access micro-benchmark designed to access a set of &lt;em&gt;cache lines&lt;/em&gt; in a random sequence—a scenario that typically baffles the data prefetcher (&lt;a href="https://github.com/jmuehlig/blog-resource/tree/main/01-profiling-specific-code-segments" rel="noopener noreferrer"&gt;see the full source code&lt;/a&gt;).&lt;br&gt;
The benchmark employs two distinct arrays: one holding the data and another containing indices that establish the random access pattern. &lt;br&gt;
After initializing these arrays, we execute the micro-benchmark by sequentially scanning through the indices array and accessing data from the data array. This method generally leads to approximately &lt;strong&gt;one cache miss per access&lt;/strong&gt; within the contiguous data array.&lt;/p&gt;
&lt;h3&gt;
  
  
  Perf Stat
&lt;/h3&gt;

&lt;p&gt;To observe the underlying hardware dynamics, we utilize the &lt;a href="https://perfwiki.github.io/main/tutorial/#counting-with-perf-stat" rel="noopener noreferrer"&gt;&lt;code&gt;perf stat&lt;/code&gt; command&lt;/a&gt;, which quantifies low-level hardware events such as &lt;em&gt;L1 data cache&lt;/em&gt; accesses and references during the execution of the micro-benchmark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf &lt;span class="nb"&gt;stat&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; instructions,cycles,L1-dcache-loads,L1-dcache-load-misses &lt;span class="nt"&gt;--&lt;/span&gt; ./random-access-bench &lt;span class="nt"&gt;--size&lt;/span&gt; 16777216
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running, &lt;code&gt;perf stat&lt;/code&gt; displays the results on the command line, in combination with metrics such as &lt;em&gt;instructions per cycle&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Performance counter stats &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="s1"&gt;'./random-access-bench --size 16777216'&lt;/span&gt;:

    3,697,089,032      instructions            &lt;span class="c"&gt;#    0.63  insn per cycle            &lt;/span&gt;
    5,879,736,227      cycles                                                                
    1,186,826,319      L1-dcache-loads                                                       
      103,262,784      L1-dcache-load-misses   &lt;span class="c"&gt;#    8.70% of all L1-dcache accesses &lt;/span&gt;

      1.202831289 seconds &lt;span class="nb"&gt;time &lt;/span&gt;elapsed

      0.799309000 seconds user
      0.403155000 seconds sys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zooming into details, the results reveal &lt;code&gt;103,262,784&lt;/code&gt; &lt;em&gt;L1d&lt;/em&gt; misses for &lt;code&gt;16,777,216&lt;/code&gt; items, which translates to &lt;code&gt;103,262,784 / 16,777,216&lt;/code&gt; ≈ &lt;code&gt;6&lt;/code&gt; misses per item. &lt;br&gt;
This number significantly surpasses the &lt;strong&gt;anticipated single cache miss&lt;/strong&gt; per item.&lt;br&gt;
The source of this discrepancy lies in the comprehensive scope of the &lt;code&gt;perf stat&lt;/code&gt; command, which records events throughout the entire runtime of the benchmark. &lt;br&gt;
This includes the initialization stage of the benchmark where both the data and pattern arrays are allocated and filled.&lt;br&gt;
Ideally, however, profiling should be confined to the specific segment of the code that interacts directly with the data array to achieve more accurate metrics.&lt;/p&gt;

&lt;p&gt;One effective strategy for more control over profiling is to start and stop hardware counters at specific code segments using file descriptors. &lt;br&gt;
This technique is well-documented in the &lt;a href="https://man7.org/linux/man-pages/man1/perf-stat.1.html" rel="noopener noreferrer"&gt;&lt;code&gt;perf stat&lt;/code&gt; man page&lt;/a&gt;. &lt;br&gt;
Pramod Kumbhar provides a practical guide to implementing this technique on &lt;a href="https://pramodkumbhar.com/2024/04/linux-perf-measuring-specific-code-sections-with-pause-resume-apis/" rel="noopener noreferrer"&gt;his blog&lt;/a&gt;, though some might find the approach somewhat cumbersome to implement.&lt;/p&gt;
&lt;h2&gt;
  
  
  Controlling Performance Counters from C++ Applications
&lt;/h2&gt;

&lt;p&gt;Another strategy for achieving refined control over PMUs is to leverage the &lt;em&gt;perf subsystem&lt;/em&gt; directly from C and C++ applications through the &lt;a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;perf_event_open&lt;/code&gt; system call&lt;/a&gt;. &lt;br&gt;
Given the complexity of this interface, various libraries have been developed to simplify interaction by embedding the &lt;code&gt;perf_event_open&lt;/code&gt; system call into their framework. &lt;br&gt;
Notable examples include &lt;a href="https://github.com/icl-utk-edu/papi" rel="noopener noreferrer"&gt;PAPI&lt;/a&gt;, &lt;a href="https://github.com/viktorleis/perfevent" rel="noopener noreferrer"&gt;PerfEvent&lt;/a&gt;, and &lt;a href="https://github.com/jmuehlig/perf-cpp" rel="noopener noreferrer"&gt;perf-cpp&lt;/a&gt;, each designed to offer a more accessible gateway to these advanced functionalities.&lt;/p&gt;

&lt;p&gt;This article will specifically explore &lt;a href="https://github.com/jmuehlig/perf-cpp" rel="noopener noreferrer"&gt;perf-cpp&lt;/a&gt; and demonstrate practical examples of how to activate and deactivate hardware performance counters for targeted code segments. &lt;br&gt;
The &lt;code&gt;perf::EventCounter&lt;/code&gt; class in &lt;em&gt;perf-cpp&lt;/em&gt; allows users to define which events to measure and provides &lt;code&gt;start()&lt;/code&gt; and &lt;code&gt;stop()&lt;/code&gt; methods to manage the counters.&lt;br&gt;
Below is a code snippet that sets up the &lt;code&gt;EventCounter&lt;/code&gt; and focuses the measurement on the desired code segment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;perfcpp/event_counter.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;/// Initialize the hardware event counter&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;counters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;perf&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CounterDefinition&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;event_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;perf&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;EventCounter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;counters&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;/// Specify hardware events to count&lt;/span&gt;
&lt;span class="n"&gt;event_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"instructions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"cycles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"cache-references"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"cache-misses"&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;/// Setup benchmark here (this will not be measured)&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;alignas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64U&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;cache_line&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;cache_line&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="c1"&gt;/// Fill both vectors here...&lt;/span&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0ULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;/// Run the workload and count hardware events&lt;/span&gt;
&lt;span class="n"&gt;event_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// &amp;lt;-- critical memory access&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;asm&lt;/span&gt; &lt;span class="nf"&gt;volatile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"r,m"&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"memory"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Ensure the compiler will not optimize sum away&lt;/span&gt;
&lt;span class="n"&gt;event_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the &lt;code&gt;EventCounter&lt;/code&gt; is initiated and the events of interest are added, we set up the benchmark by initializing the data and pattern arrays. &lt;br&gt;
Enclosing the workload we wish to measure with &lt;code&gt;start()&lt;/code&gt; and &lt;code&gt;stop()&lt;/code&gt; calls enables precise monitoring of that particular code segment. &lt;br&gt;
Upon stopping the counter, the &lt;code&gt;EventCounter&lt;/code&gt; can be queried to obtain the measured events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;/// Print the performance counters.&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;" "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;" ("&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;16777216&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;" per access)"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;endl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output reflects only the activity during the benchmark, effectively excluding the initial setup phase where data is allocated, and patterns are established:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;102,284,667 instructions            &lt;span class="o"&gt;(&lt;/span&gt;6.09664 per access&lt;span class="o"&gt;)&lt;/span&gt;
992,091,716 cycles                  &lt;span class="o"&gt;(&lt;/span&gt;59.1333 per access&lt;span class="o"&gt;)&lt;/span&gt;
 34,227,532 L1-dcache-loads         &lt;span class="o"&gt;(&lt;/span&gt;2.04012 per access&lt;span class="o"&gt;)&lt;/span&gt;
 18,944,008 L1-dcache-load-misses   &lt;span class="o"&gt;(&lt;/span&gt;1.12915 per access&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results obtained are markedly more explicable than those we got from the &lt;code&gt;perf stat&lt;/code&gt; command.&lt;br&gt;
We observe two &lt;em&gt;L1d cache references&lt;/em&gt; per access: one for the randomly accessed cache line and another for the index of the pattern array.&lt;br&gt;
Additionally, there are approximately &lt;code&gt;1.3&lt;/code&gt; &lt;em&gt;cache misses&lt;/em&gt;—one for each data cache line and &lt;code&gt;0.125&lt;/code&gt; for the access index, as eight indices fit into a single cache line of the pattern array. &lt;/p&gt;
&lt;h2&gt;
  
  
  Hardware-specific Events
&lt;/h2&gt;

&lt;p&gt;While basic performance metrics such as &lt;em&gt;instructions&lt;/em&gt;, &lt;em&gt;cycles&lt;/em&gt;, and &lt;em&gt;cache misses&lt;/em&gt; shed light on the interplay of hardware and software, modern CPUs offer a far broader spectrum of events to monitor.&lt;br&gt;
However, it's important to note that many of these events are specific to the underlying hardware substrate.&lt;br&gt;
The &lt;em&gt;perf subsystem&lt;/em&gt; standardizes only a select group of events universally supported across different processors (&lt;a href="https://github.com/jmuehlig/perf-cpp/blob/dev/docs/counters.md#built-in-events" rel="noopener noreferrer"&gt;see a detailed list&lt;/a&gt;).&lt;br&gt;
To discover the full range of events available on specific CPUs, one can utilize the &lt;code&gt;perf list&lt;/code&gt; command. &lt;br&gt;
Additionally, Intel provides an extensive catalog of events for various architectures on their &lt;a href="https://perfmon-events.intel.com/" rel="noopener noreferrer"&gt;perfmon website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In order to use hardware-specific counters within applications, the readable event names need to be translated into event codes.&lt;br&gt;
To that end, &lt;a href="https://github.com/wcohen/libpfm4" rel="noopener noreferrer"&gt;Libpfm4&lt;/a&gt; provides a valuable tool that translates event names (from &lt;code&gt;perf list&lt;/code&gt;) into codes.&lt;/p&gt;

&lt;p&gt;Let us consider the event &lt;code&gt;CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD&lt;/code&gt; on the AMD Zen4 architecture as an example.&lt;br&gt;
The event quantifies the CPU cycles stalled due to pending memory requests, which is particularly insightful for assessing the effects of cache misses on modern systems. &lt;br&gt;
Intel offers analogous events, such as &lt;code&gt;CYCLE_ACTIVITY.STALLS_MEM_ANY&lt;/code&gt; on the Cascade Lake architecture, and both &lt;code&gt;EXE_ACTIVITY.BOUND_ON_LOADS&lt;/code&gt; and &lt;code&gt;EXE_ACTIVITY.BOUND_ON_STORES&lt;/code&gt; on the Sapphire Rapids architecture.&lt;/p&gt;

&lt;p&gt;After downloading and compiling &lt;em&gt;Libpfm4&lt;/em&gt;, developers can fetch the code for a specific event as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./examples/check_events CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD

Requested Event: CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD
Actual    Event: amd64_fam19h_zen4::CYCLES_NO_RETIRE:NOT_COMPLETE_MISSING_LOAD:k&lt;span class="o"&gt;=&lt;/span&gt;1:u&lt;span class="o"&gt;=&lt;/span&gt;1:e&lt;span class="o"&gt;=&lt;/span&gt;0:i&lt;span class="o"&gt;=&lt;/span&gt;0:c&lt;span class="o"&gt;=&lt;/span&gt;0:h&lt;span class="o"&gt;=&lt;/span&gt;0:g&lt;span class="o"&gt;=&lt;/span&gt;0
PMU            : AMD64 Fam19h Zen4
IDX            : 1077936192
Codes          : 0x53a2d6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Incorporating hardware-specific events into an application with &lt;em&gt;perf-cpp&lt;/em&gt; would look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;perfcpp/event_counter.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;/// Initialize the hardware event counter&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;counters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;perf&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CounterDefinition&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="n"&gt;counters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x53a2d6&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// &amp;lt;-- Event code from Libpfm4 output&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;event_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;perf&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;EventCounter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;counters&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;/// Specify hardware events to count&lt;/span&gt;
&lt;span class="n"&gt;event_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"cycles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD"&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;/// Setup and execute the benchmark as demonstrated above...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This precise tracking reveals that approximately &lt;code&gt;57&lt;/code&gt; of &lt;code&gt;59&lt;/code&gt; CPU cycles are spent waiting for memory loads to complete–a finding consistent with the inability of the hardware to predict the benchmark’s random access pattern, relying instead on inherent memory latency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;992,091,716 cycles                                      &lt;span class="o"&gt;(&lt;/span&gt;59.1333 per access&lt;span class="o"&gt;)&lt;/span&gt;
967,301,682 CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD  &lt;span class="o"&gt;(&lt;/span&gt;57.6557 per access&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, thanks to sophisticated out-of-order execution, the hardware effectively masks much of this latency, which on the specific machine to execute the benchmark is around &lt;code&gt;700&lt;/code&gt; cycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Profiling tools play a crucial role in identifying bottlenecks and aiding developers in optimizing their code. &lt;br&gt;
Yet, the broad granularity often means that key code segments tracked with &lt;code&gt;perf stat&lt;/code&gt; can be obscured by extraneous data. &lt;br&gt;
Libraries like &lt;a href="https://github.com/icl-utk-edu/papi" rel="noopener noreferrer"&gt;PAPI&lt;/a&gt;, &lt;a href="https://github.com/viktorleis/perfevent" rel="noopener noreferrer"&gt;PerfEvent&lt;/a&gt;, and &lt;a href="https://github.com/jmuehlig/perf-cpp" rel="noopener noreferrer"&gt;perf-cpp&lt;/a&gt; offer a solution by allowing direct control over hardware performance counters from within the application itself. &lt;br&gt;
By leveraging the &lt;em&gt;perf subsystem&lt;/em&gt; (more precisely the &lt;code&gt;perf_event_open&lt;/code&gt; system call), these tools enable precise measurements of only the code segments that are truly relevant.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>linux</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
