<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maarten Breddels</title>
    <description>The latest articles on DEV Community by Maarten Breddels (@maartenbreddels).</description>
    <link>https://dev.to/maartenbreddels</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F296144%2Fc527ae79-770c-462e-9519-ebb849cc3f23.jpeg</url>
      <title>DEV Community: Maarten Breddels</title>
      <link>https://dev.to/maartenbreddels</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maartenbreddels"/>
    <language>en</language>
    <item>
      <title>Tracing and visualizing the Python GIL with perf and VizTracer</title>
      <dc:creator>Maarten Breddels</dc:creator>
      <pubDate>Thu, 14 Jan 2021 11:26:46 +0000</pubDate>
      <link>https://dev.to/maartenbreddels/tracing-and-visualizing-the-python-gil-with-perf-and-viztracer-2jpp</link>
      <guid>https://dev.to/maartenbreddels/tracing-and-visualizing-the-python-gil-with-perf-and-viztracer-2jpp</guid>
      <description>&lt;h1&gt;
  
  
  Why the GIL matters
&lt;/h1&gt;

&lt;p&gt;There are plenty of articles explaining why the Python GIL (The Global Interpreter Lock) exists&lt;sup id="fnref1"&gt;1&lt;/sup&gt;, and why it is there. The TLDR version is: the GIL prevents multithreaded pure Python code from using multiple CPU cores.&lt;/p&gt;

&lt;p&gt;However, in &lt;a href="https://vaex.io" rel="noopener noreferrer"&gt;Vaex&lt;/a&gt; we execute most of the CPU intensive parts in C (C++) code, where we release the GIL. This is a common practice in high-performance Python libraries, where Python acts merely as a high-level glue.&lt;/p&gt;

&lt;p&gt;However, the GIL needs to be released explicitly, and this is the responsibility of the programmer and might be forgotten, leading to suboptimal use of your machine.&lt;/p&gt;

&lt;p&gt;I recently had this issue in &lt;a href="https://github.com/vaexio/vaex/pull/1114" rel="noopener noreferrer"&gt;Vaex&lt;/a&gt; where I simply forgot to release the GIL and found a similar issue in &lt;a href="https://github.com/apache/arrow/pull/7756" rel="noopener noreferrer"&gt;Apache Arrow&lt;/a&gt;&lt;sup id="fnref2"&gt;2&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Also, when running on 64 cores, I sometimes see a performance in Vaex that I am not happy with. It might be using 4000% CPU, instead of 6400% CPU, which is something I am not happy with. Instead of blindly pulling some levers to inspect the effect, I want to understand what is happening, and if the GIL is the problem, why, and where is it holding Vaex down. &lt;/p&gt;

&lt;h1&gt;
  
  
  Why I write this
&lt;/h1&gt;

&lt;p&gt;I'm planning to write a series of articles explaining some tools and techniques available for profiling/tracing Python together with native extensions, and how these tools can be glued together, to analyze and visualize what Python is doing, and when the GIL it taken or dropped.&lt;/p&gt;

&lt;p&gt;I hope this leads to improvement of tracing, profiling, and other performance tooling in the Python ecosystem, and the performance of the whole Python ecosystem.&lt;/p&gt;

&lt;h1&gt;
  
  
  Prerequisites
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Linux
&lt;/h2&gt;

&lt;p&gt;Get access to a Linux machine, and make sure you have root privileges (sudo is fine), or ask your sysadmin to execute some of these commands for you. For the rest of the document, we only run as user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perf
&lt;/h2&gt;

&lt;p&gt;Make sure you have perf installed, e.g. on Ubuntu:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ sudo yum install perf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Kernel configuration
&lt;/h2&gt;

&lt;p&gt;To enable running it as a user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Enable users to run perf (use at own risk)
$ sudo sysctl kernel.perf_event_paranoid=-1

# Enable users to see schedule trace events:
$ sudo mount -o remount,mode=755 /sys/kernel/debug
$ sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python packages
&lt;/h2&gt;

&lt;p&gt;We will make use of &lt;a href="https://github.com/gaogaotiantian/viztracer/" rel="noopener noreferrer"&gt;VizTracer&lt;/a&gt; and &lt;a href="https://github.com/maartenbreddels/per4m" rel="noopener noreferrer"&gt;per4m&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pip install "viztracer&amp;gt;=0.11.2" "per4m&amp;gt;=0.1,&amp;lt;0.2"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Track thread or process states via perf
&lt;/h1&gt;

&lt;p&gt;There is no way to get the GIL state in Python &lt;sup id="fnref3"&gt;3&lt;/sup&gt; since there is no API for this. We can track it from the kernel, and the right tool for this under Linux is &lt;strong&gt;perf&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Using the linux perf tool (aka perf_events), we can listen to the state changes for processes/threads (we only care about sleeping and running), and log them. Although perf may look scary, it is a powerful tool. If you want to know a bit more about perf, I recommend reading &lt;a href="https://jvns.ca/blog/2018/04/16/new-perf-zine/" rel="noopener noreferrer"&gt;Julia Evans' zine on perf&lt;/a&gt; or &lt;a href="http://www.brendangregg.com/perf.html" rel="noopener noreferrer"&gt;go through Brendan Gregg's website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To build our intuition, we will first run perf on a &lt;a href="https://github.com/maartenbreddels/per4m/blob/master/per4m/example0.py" rel="noopener noreferrer"&gt;very trivial program&lt;/a&gt;:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file per4m/example0.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Thread&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sleep_a_bit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sleep_a_bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We listen to just a few events to keep the noise down (note the use of wildcards):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf record -e sched:sched_switch  -e sched:sched_process_fork \
        -e 'sched:sched_wak*' -- python -m per4m.example0
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0,032 MB perf.data (33 samples) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And use the &lt;code&gt;perf script&lt;/code&gt; command to write human/parsable output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script
        :3040108 3040108 [032] 5563910.979408:                sched:sched_waking: comm=perf pid=3040114 prio=120 target_cpu=031
        :3040108 3040108 [032] 5563910.979431:                sched:sched_wakeup: comm=perf pid=3040114 prio=120 target_cpu=031
          python 3040114 [031] 5563910.995616:                sched:sched_waking: comm=kworker/31:1 pid=2502104 prio=120 target_cpu=031
          python 3040114 [031] 5563910.995618:                sched:sched_wakeup: comm=kworker/31:1 pid=2502104 prio=120 target_cpu=031
          python 3040114 [031] 5563910.995621:                sched:sched_waking: comm=ksoftirqd/31 pid=198 prio=120 target_cpu=031
          python 3040114 [031] 5563910.995622:                sched:sched_wakeup: comm=ksoftirqd/31 pid=198 prio=120 target_cpu=031
          python 3040114 [031] 5563910.995624:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=R+ ==&amp;gt; next_comm=kworker/31:1 next_pid=2502104 next_prio=120
          python 3040114 [031] 5563911.003612:                sched:sched_waking: comm=kworker/32:1 pid=2467833 prio=120 target_cpu=032
          python 3040114 [031] 5563911.003614:                sched:sched_wakeup: comm=kworker/32:1 pid=2467833 prio=120 target_cpu=032
          python 3040114 [031] 5563911.083609:                sched:sched_waking: comm=ksoftirqd/31 pid=198 prio=120 target_cpu=031
          python 3040114 [031] 5563911.083612:                sched:sched_wakeup: comm=ksoftirqd/31 pid=198 prio=120 target_cpu=031
          python 3040114 [031] 5563911.083613:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=R ==&amp;gt; next_comm=ksoftirqd/31 next_pid=198 next_prio=120
          python 3040114 [031] 5563911.108984:                sched:sched_waking: comm=node pid=2446812 prio=120 target_cpu=045
          python 3040114 [031] 5563911.109059:                sched:sched_waking: comm=node pid=2446812 prio=120 target_cpu=045
          python 3040114 [031] 5563911.112250:          sched:sched_process_fork: comm=python pid=3040114 child_comm=python child_pid=3040116
          python 3040114 [031] 5563911.112260:            sched:sched_wakeup_new: comm=python pid=3040116 prio=120 target_cpu=037
          python 3040114 [031] 5563911.112262:            sched:sched_wakeup_new: comm=python pid=3040116 prio=120 target_cpu=037
          python 3040114 [031] 5563911.112273:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/31 next_pid=0 next_prio=120
          python 3040116 [037] 5563911.112418:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040116 [037] 5563911.112450:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040116 [037] 5563911.112473: sched:sched_wake_idle_without_ipi: cpu=31
         swapper     0 [031] 5563911.112476:                sched:sched_wakeup: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040114 [031] 5563911.112485:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/31 next_pid=0 next_prio=120
          python 3040116 [037] 5563911.112485:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040116 [037] 5563911.112489:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040116 [037] 5563911.112496:                sched:sched_switch: prev_comm=python prev_pid=3040116 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/37 next_pid=0 next_prio=120
         swapper     0 [031] 5563911.112497:                sched:sched_wakeup: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040114 [031] 5563911.112513:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/31 next_pid=0 next_prio=120
         swapper     0 [037] 5563912.113490:                sched:sched_waking: comm=python pid=3040116 prio=120 target_cpu=037
         swapper     0 [037] 5563912.113529:                sched:sched_wakeup: comm=python pid=3040116 prio=120 target_cpu=037
          python 3040116 [037] 5563912.113595:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
          python 3040116 [037] 5563912.113620:                sched:sched_waking: comm=python pid=3040114 prio=120 target_cpu=031
         swapper     0 [031] 5563912.113697:                sched:sched_wakeup: comm=python pid=3040114 prio=120 target_cpu=031
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take a moment to digest the output. I can see a few things. Looking at the 4th column (time in seconds), we can see where the program slept (it skips 1 second). Here we see that we enter the sleeping state with a line like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python 3040114 [031] 5563911.112513:                sched:sched_switch: prev_comm=python prev_pid=3040114 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/31 next_pid=0 next_prio=120&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This means the kernel changed the state of the Python thread to &lt;code&gt;S&lt;/code&gt; (=sleeping) state.&lt;/p&gt;

&lt;p&gt;A full second later, we see it being woken up:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;swapper     0 [031] 5563912.113697:                sched:sched_wakeup: comm=python pid=3040114 prio=120 target_cpu=031&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Of course, you need to build some tooling around this, to really see what is happening. But one can imagine this output can be easily parsed and this is what &lt;a href="https://github.com/maartenbreddels/per4m/" rel="noopener noreferrer"&gt;per4m&lt;/a&gt; does. However, before we go there, I'd first like to visualize the flow of a slightly more advanced program using &lt;a href="https://github.com/gaogaotiantian/viztracer/" rel="noopener noreferrer"&gt;VizTracer&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  VizTracer
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://github.com/gaogaotiantian/viztracer/" rel="noopener noreferrer"&gt;VizTracer&lt;/a&gt; is a Python tracer that can visualize what your program does in the browser. Let us run it on a slightly &lt;a href="https://github.com/maartenbreddels/per4m/blob/master/per4m/example1.py" rel="noopener noreferrer"&gt;more advanced example&lt;/a&gt; to see what it looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file per4m/example1.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;some_computation&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;   
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;thread1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;some_computation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;thread2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;some_computation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;thread1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;thread2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thread1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thread2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running viztracer gives output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ viztracer -o example1.html --ignore_frozen -m per4m.example1
Loading finish                                        
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/example1.html ...
Dumping trace data to json, total entries: 94, estimated json file size: 11.0KiB
Generating HTML report
Report saved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the HTML should render as:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqhifda596tam0f4kgs86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqhifda596tam0f4kgs86.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this, it seems that &lt;code&gt;some_computation&lt;/code&gt; seem to be executed in parallel (twice), while in fact, we know the GIL is preventing that. So what is really going on?&lt;/p&gt;
&lt;h1&gt;
  
  
  Mixing VizTracer and perf output
&lt;/h1&gt;

&lt;p&gt;Let us run &lt;code&gt;perf&lt;/code&gt; on this, similarly to what we did to example0.py. However, we add the argument &lt;code&gt;-k CLOCK_MONOTONIC&lt;/code&gt; so that we use &lt;a href="https://github.com/gaogaotiantian/viztracer/blob/3321ba4024afe5623f938a601d7f7db3b08f534d/src/viztracer/modules/snaptrace.c#L91" rel="noopener noreferrer"&gt;the same clock as VizTracer&lt;/a&gt; and ask VizTracer to generate a JSON, instead of an HTML file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf record -e sched:sched_switch  -e sched:sched_process_fork -e 'sched:sched_wak*' \
   -k CLOCK_MONOTONIC  -- viztracer -o viztracer1.json --ignore_frozen -m per4m.example1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we use &lt;code&gt;per4m&lt;/code&gt; to translate the perf script results into a JSON that VizTracer can read&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script | per4m perf2trace sched -o perf1.json
Wrote to perf1.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, use VizTracer to combine the two JSON files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ viztracer --combine perf1.json viztracer1.json -o example1_state.html
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/example1.html ...
Dumping trace data to json, total entries: 131, estimated json file size: 15.4KiB
Generating HTML report
Report saved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhbextjzyiv6cxsqcewmz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhbextjzyiv6cxsqcewmz.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this visualization, it is clear the threads regularly enter a sleeping state due to the GIL and do not execute in parallel.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The length of the sleep phase is ~5ms, which corresponds to the default value of &lt;a href="https://docs.python.org/3/library/sys.html#sys.getswitchinterval" rel="noopener noreferrer"&gt;sys.getswitchinterval&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;
  
  
  Detecting the GIL
&lt;/h1&gt;

&lt;p&gt;We see our process sleeping, but we do not see any difference between the sleeping state being triggered by calling &lt;code&gt;time.sleep&lt;/code&gt; and due to the GIL. There are multiple way to see the difference, and we will present two methods.&lt;/p&gt;
&lt;h2&gt;
  
  
  Via stack traces
&lt;/h2&gt;

&lt;p&gt;Using &lt;code&gt;perf record -g&lt;/code&gt; (or better &lt;code&gt;perf record --call-graph dwarf&lt;/code&gt; which implies &lt;code&gt;-g&lt;/code&gt;), we get stack traces for each event of perf.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf record -e sched:sched_switch  -e sched:sched_process_fork -e 'sched:sched_wak*'\
   -k CLOCK_MONOTONIC  --call-graph dwarf -- viztracer -o viztracer1-gil.json --ignore_frozen -m per4m.example1
Loading finish                                        
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/viztracer1-gil.json ...
Dumping trace data to json, total entries: 94, estimated json file size: 11.0KiB
Report saved.
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0,991 MB perf.data (164 samples) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at the output of &lt;code&gt;perf script&lt;/code&gt; (where we add &lt;code&gt;--no-inline&lt;/code&gt; for performance reasons), we get a load of information. Looking at a state change event, we can now see that &lt;a href="https://github.com/python/cpython/blob/1a9f51ed12feb4d95ad6d0faf610a030c05b9f5e/Python/ceval_gil.h#L215" rel="noopener noreferrer"&gt;take_gil&lt;/a&gt; was called!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script --no-inline | less
...
viztracer 3306851 [059] 5614683.022539:                sched:sched_switch: prev_comm=viztracer prev_pid=3306851 prev_prio=120 prev_state=S ==&amp;gt; next_comm=swapper/59 next_pid=0 next_prio=120
        ffffffff96ed4785 __sched_text_start+0x375 ([kernel.kallsyms])
        ffffffff96ed4785 __sched_text_start+0x375 ([kernel.kallsyms])
        ffffffff96ed4b92 schedule+0x42 ([kernel.kallsyms])
        ffffffff9654a51b futex_wait_queue_me+0xbb ([kernel.kallsyms])
        ffffffff9654ac85 futex_wait+0x105 ([kernel.kallsyms])
        ffffffff9654daff do_futex+0x10f ([kernel.kallsyms])
        ffffffff9654dfef __x64_sys_futex+0x13f ([kernel.kallsyms])
        ffffffff964044c7 do_syscall_64+0x57 ([kernel.kallsyms])
        ffffffff9700008c entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
            7f4884b977b1 pthread_cond_timedwait@@GLIBC_2.3.2+0x271 (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
            55595c07fe6d take_gil+0x1ad (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfaa0b3 PyEval_RestoreThread+0x23 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c000872 lock_PyThread_acquire_lock+0x1d2 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfe71f3 _PyMethodDef_RawFastCallKeywords+0x263 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfe7313 _PyCFunction_FastCallKeywords+0x23 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d657 call_function+0x3b7 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb6db1 _PyEval_EvalCodeWithName+0x251 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6b00 _PyFunction_FastCallKeywords+0x520 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb6db1 _PyEval_EvalCodeWithName+0x251 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6b00 _PyFunction_FastCallKeywords+0x520 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6766 _PyFunction_FastCallKeywords+0x186 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6766 _PyFunction_FastCallKeywords+0x186 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060ae4 _PyEval_EvalFrameDefault+0x3f4 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb6db1 _PyEval_EvalCodeWithName+0x251 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c074e5d builtin_exec+0x33d (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfe7078 _PyMethodDef_RawFastCallKeywords+0xe8 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfe7313 _PyCFunction_FastCallKeywords+0x23 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c066c39 _PyEval_EvalFrameDefault+0x6549 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb77e0 _PyEval_EvalCodeWithName+0xc80 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6b62 _PyFunction_FastCallKeywords+0x582 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6766 _PyFunction_FastCallKeywords+0x186 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6766 _PyFunction_FastCallKeywords+0x186 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c01d334 call_function+0x94 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060d00 _PyEval_EvalFrameDefault+0x610 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfd6766 _PyFunction_FastCallKeywords+0x186 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c060ae4 _PyEval_EvalFrameDefault+0x3f4 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb6db1 _PyEval_EvalCodeWithName+0x251 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595bfb81e2 PyEval_EvalCode+0x22 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c0c51d1 run_mod+0x31 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c0cf31d PyRun_FileExFlags+0x9d (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c0cf50a PyRun_SimpleFileExFlags+0x1ba (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c0d05f0 pymain_main+0x3e0 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            55595c0d067b _Py_UnixMain+0x3b (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
            7f48849bc0b2 __libc_start_main+0xf2 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
            55595c075100 _start+0x28 (/home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: We also see that &lt;code&gt;pthread_cond_timedwait&lt;/code&gt; is called, this is what &lt;a href="https://github.com/sumerc/gilstats.py" rel="noopener noreferrer"&gt;https://github.com/sumerc/gilstats.py&lt;/a&gt; uses for a eBPF tool, in case you are interested in other mutexes.&lt;/p&gt;

&lt;p&gt;Note: Also note that we do not see Python stacktrace, but &lt;code&gt;_PyEval_EvalFrameDefault&lt;/code&gt; etcetera instead. I plan to write how to inject Python stacktraces in a future article.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;code&gt;per4m perf2trace&lt;/code&gt; convert tool understands this and will generate different output when &lt;code&gt;take_gil&lt;/code&gt; is in the stacktrace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script --no-inline | per4m perf2trace sched -o perf1-gil.json
Wrote to perf1-gil.json
$ viztracer --combine perf1-gil.json viztracer1-gil.json -o example1-gil.html
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/example1.html ...
Dumping trace data to json, total entries: 131, estimated json file size: 15.4KiB
Generating HTML report
Report saved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flkf7huk66oh7ov5ka4x6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flkf7huk66oh7ov5ka4x6.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we really see where the GIL plays a role!&lt;/p&gt;
&lt;h2&gt;
  
  
  Via probes (kprobes/uprobes)
&lt;/h2&gt;

&lt;p&gt;We now know when processes are sleeping (due to the GIL or other reasons), but if we want to get a more detailed look at where the GIL is taken or dropped, we need to know where &lt;code&gt;take_gil&lt;/code&gt; and &lt;code&gt;drop_gil&lt;/code&gt; are being called as well as returned. This can be traced using uprobes via perf. Uprobes are probes in userland, the equivalent to kprobes, which as you may have guessed operate in kernel space. &lt;a href="https://jvns.ca/blog/2017/07/09/linux-tracing-zine/" rel="noopener noreferrer"&gt;Julia Evans&lt;/a&gt; is again a great resource.&lt;/p&gt;

&lt;p&gt;Let us install the 4 uprobes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo perf probe -f -x `which python` python:take_gil=take_gil
sudo perf probe -f -x `which python` python:take_gil=take_gil%return
sudo perf probe -f -x `which python` python:drop_gil=drop_gil
sudo perf probe -f -x `which python` python:drop_gil=drop_gil%return

Added new events:
  python:take_gil      (on take_gil in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
  python:take_gil_1    (on take_gil in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)

You can now use it in all perf tools, such as:

        perf record -e python:take_gil_1 -aR sleep 1

Failed to find "take_gil%return",
 because take_gil is an inlined function and has no return point.
Added new event:
  python:take_gil__return (on take_gil%return in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)

You can now use it in all perf tools, such as:

        perf record -e python:take_gil__return -aR sleep 1

Added new events:
  python:drop_gil      (on drop_gil in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)
  python:drop_gil_1    (on drop_gil in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)

You can now use it in all perf tools, such as:

        perf record -e python:drop_gil_1 -aR sleep 1

Failed to find "drop_gil%return",
 because drop_gil is an inlined function and has no return point.
Added new event:
  python:drop_gil__return (on drop_gil%return in /home/maartenbreddels/miniconda/envs/dev/bin/python3.7)

You can now use it in all perf tools, such as:

        perf record -e python:drop_gil__return -aR sleep 1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It complains a bit and seems to add multiple probes/events because &lt;code&gt;drop_gil&lt;/code&gt; and &lt;code&gt;take_gil&lt;/code&gt; are inlined (which means the function is present multiple times in the binary), but it seems to work 🤷 (let me know in the comments if it does not work for you).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: I may have been lucky, that the Python binary that I use (from conda-forge) is compiled in such a way that the relevant take_gil/drop_gil (and its return) that succeed are the relevant ones for this problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Note that the probes cause no performance hit, only when they are 'active' (like when we monitor them under perf) will the trigger a different code path. When monitored, the affected pages for the monitored process will be copied, and &lt;a href="https://lwn.net/Articles/499190/" rel="noopener noreferrer"&gt;breakpoints are inserted at the right locations&lt;/a&gt; (INT3 for x86 CPUs). The breakpoint will trigger the event for perf, which causes a small overhead. In case you want to remove the probes, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ sudo perf probe --del 'python*'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;perf&lt;/code&gt; understand new events that it can listen to, so let us run perf again with &lt;code&gt;-e 'python:*gil*'&lt;/code&gt; as extra argument&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf record -e sched:sched_switch  -e sched:sched_process_fork -e 'sched:sched_wak*' -k CLOCK_MONOTONIC  \
  -e 'python:*gil*' -- viztracer  -o viztracer1-uprobes.json --ignore_frozen -m per4m.example1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: we removed &lt;code&gt;--call-graph dwarf&lt;/code&gt; otherwise perf cannot keep up, and we will lose events.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We then use &lt;code&gt;per4m perf2trace&lt;/code&gt; to convert this to a JSON that VizTracer understands, while we also get some free statistics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script --no-inline | per4m perf2trace gil -o perf1-uprobes.json
...
Summary of threads:

PID         total(us)    no gil%✅    has gil%❗    gil wait%❌
--------  -----------  -----------  ------------  -------------
3353567*     164490.0         65.9          27.3            6.7
3353569       66560.0          0.3          48.2           51.5
3353570       60900.0          0.0          56.4           43.6

High 'no gil' is good (✅), we like low 'has gil' (❗),
 and we don't want 'gil wait' (❌). (* indicates main thread)
... 
Wrote to perf1-uprobes.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;per4m perf2trace gil&lt;/code&gt; subcommand also gives a &lt;a href="https://github.com/chrisjbillington/gil_load" rel="noopener noreferrer"&gt;gil_load&lt;/a&gt; like output. From this output, we see that both threads are waiting for the GIL approximately 50% of the time, as expected.&lt;/p&gt;

&lt;p&gt;Using the same &lt;code&gt;perf.data&lt;/code&gt; file, that &lt;code&gt;perf record&lt;/code&gt; generated, we can also generate the thread/process state information. However, because we ran without the stacktraces, we do not know if we are sleeping due to the GIL or not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ perf script --no-inline | per4m perf2trace sched -o perf1-state.json
Wrote to perf1-state.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At last, we combine the three outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ viztracer --combine perf1-state.json perf1-uprobes.json viztracer1-uprobes.json -o example1-uprobes.html 
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/example1-uprobes.html ...
Dumping trace data to json, total entries: 10484, estimated json file size: 1.2MiB
Generating HTML report
Report saved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our VizTracer output gives us a good overview of who has, and wants the GIL:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsgn0o2olzggb2k1py3ma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsgn0o2olzggb2k1py3ma.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Above each thread, we see when a thread/process wants to take the gil, and when it succeeded (marked by &lt;code&gt;LOCK&lt;/code&gt;). Note that these periods overlap with the periods when the thread/process is &lt;em&gt;not&lt;/em&gt; sleeping (so running!). Also note that we see only 1 thread/process in the running state, as expected due to the GIL.&lt;/p&gt;

&lt;p&gt;The time between each call to &lt;code&gt;take_gil&lt;/code&gt;, and actually obtaining the lock (and thus leaving the sleeping state, or waking up), is exactly the time in the above table in the column &lt;code&gt;gil wait%❌&lt;/code&gt;. The time each thread has the GIL, or the time spanned by &lt;code&gt;LOCK&lt;/code&gt;, is the time in the column &lt;code&gt;has gil%❗&lt;/code&gt;.&lt;/p&gt;
&lt;h1&gt;
  
  
  Releasing the Kraken... ehm GIL
&lt;/h1&gt;

&lt;p&gt;We saw a pure Python program, running multithreaded, where the GIL is limiting performance by letting only 1 thread/process run at a time &lt;sup id="fnref1"&gt;1&lt;/sup&gt;. Let us now see what happens when the code releases the GIL, such as what happens when we execute NumPy functions.&lt;/p&gt;

&lt;p&gt;The second example execute &lt;code&gt;some_numpy_computation&lt;/code&gt;, which calls a NumPy function &lt;code&gt;M=4&lt;/code&gt; times, in parallel using 2 threads, while the main thread executes pure Python code.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file per4m/example3.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;f8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;some_numpy_computation&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;



&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;thread1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;some_numpy_computation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;thread2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;some_numpy_computation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;thread1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;thread2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2_000_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thread1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thread2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Instead of running this script using perf and VizTracer, we now use the per4m &lt;code&gt;giltracer&lt;/code&gt; util, which automates all the steps done above (in a slightly smarter way&lt;sup id="fnref1"&gt;1&lt;/sup&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ giltracer --state-detect -o example2-uprobes.html -m per4m.example2
...

Summary of threads:

PID         total(us)    no gil%✅    has gil%❗    gil wait%❌
--------  -----------  -----------  ------------  -------------
3373601*    1359990.0         95.8           4.2            0.1
3373683       60276.4         84.6           2.2           13.2
3373684       57324.0         89.2           1.9            8.9

High 'no gil' is good (✅), we like low 'has gil' (❗),
 and we don't want 'gil wait' (❌). (* indicates main thread)
...
Saving report to /home/maartenbreddels/github/maartenbreddels/per4m/example2-uprobes.html ...
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9h4kh95kh2s2p4gae0oj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9h4kh95kh2s2p4gae0oj.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see that while the main thread is running Python code (it has the GIL, indicated by the &lt;code&gt;LOCK&lt;/code&gt; above it), the threads are also running. Note that in the pure Python example we only had one thread/process running at a time. While here we see moments where 3 threads are truly running parallel). This is possible because the NumPy routines that enter into C/C++/Fortran, released the GIL.&lt;/p&gt;

&lt;p&gt;However, the threads are still hampered by the GIL, since once the NumPy function returns to Python land, it needs to obtain the GIL again as can be seen by the &lt;code&gt;take_gil&lt;/code&gt; blocks taking a bit of time. This causes a 10% wait time for each thread. &lt;/p&gt;

&lt;h1&gt;
  
  
  Jupyter integration
&lt;/h1&gt;

&lt;p&gt;Since my workflow often involves working from a MacBook laptop&lt;sup id="fnref1"&gt;1&lt;/sup&gt; remotely connected to a Linux computer, I use the Jupyter notebook often to execute code remotely. Being a Jupyter developer, creating a cell magic to wrap this was mandatory.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#
# this registers the giltracer cell magic
&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;load_ext&lt;/span&gt; &lt;span class="n"&gt;per4m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;giltracer&lt;/span&gt;
&lt;span class="c1"&gt;# this call the main define above, but this can also be a multiline code cell
&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Saving report to /tmp/tmpvi8zw9ut/viztracer.json ...
Dumping trace data to json, total entries: 117, estimated json file size: 13.7KiB
Report saved.

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,094 MB /tmp/tmpvi8zw9ut/perf.data (244 samples) ]

Wait for perf to finish...
perf script -i /tmp/tmpvi8zw9ut/perf.data --no-inline --ns | per4m perf2trace gil -o /tmp/tmpvi8zw9ut/giltracer.json -q -v 
Saving report to /home/maartenbreddels/github/maartenbreddels/fastblog/_notebooks/giltracer.html ...
Dumping trace data to json, total entries: 849, estimated json file size: 99.5KiB
Generating HTML report
Report saved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Download giltracer.html&lt;/p&gt;

&lt;p&gt;Open giltracer.html in new tab (might not work due to security issue)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Using perf, we can detect process/thread states which gives us an idea which thread/process has the GIL in Python. Using stacktraces, we can find out if the sleeping states are really due to the GIL, and not due to &lt;code&gt;time.sleep&lt;/code&gt; for instance.&lt;/p&gt;

&lt;p&gt;Combining uprobes with perf, we can trace the calling and returning of the &lt;code&gt;take_gil&lt;/code&gt; and &lt;code&gt;drop_gil&lt;/code&gt; functions, getting an even better view on the effect of the GIL on your Python program.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;per4m&lt;/code&gt; Python package, is an experimental package to do some of the &lt;code&gt;perf script&lt;/code&gt; to VizTracer JSON format, and some orchestration tools to make this easier to work with.&lt;/p&gt;

&lt;h1&gt;
  
  
  TLDR
&lt;/h1&gt;

&lt;p&gt;If you just want to see where the GIL matters:&lt;/p&gt;

&lt;p&gt;Run this once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo yum install perf
sudo sysctl kernel.perf_event_paranoid=-1
sudo mount -o remount,mode=755 /sys/kernel/debug
sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
sudo perf probe -f -x `which python` python:take_gil=take_gil
sudo perf probe -f -x `which python` python:take_gil=take_gil%return
sudo perf probe -f -x `which python` python:drop_gil=drop_gil
sudo perf probe -f -x `which python` python:drop_gil=drop_gil%return
pip install "viztracer&amp;gt;=0.11.2" "per4m&amp;gt;=0.1,&amp;lt;0.2"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# module
$ giltracer per4m/example2.py
# script
$ giltracer -m per4m.example2
# add thread/process state detection
$ giltracer --state-detect -m per4m.example2
# without uprobes (in case that fails)
$ giltracer --state-detect --no-gil-detect -m per4m.example2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Future plans
&lt;/h1&gt;

&lt;p&gt;I wish I did not have to develop these tools, and hope I inspire other to build better tooling, that will deprecate mine. I want to write high-performance code, and focus on that.&lt;/p&gt;

&lt;p&gt;However, I do see some options that I plan to write about in the future:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expose hardware performance counter in VizTracer, to see for instance cache misses, or CPU stalling.&lt;/li&gt;
&lt;li&gt;Inject Python stacktraces into perf stacktraces, so we can combine them with tools from &lt;a href="http://www.brendangregg.com/" rel="noopener noreferrer"&gt;http://www.brendangregg.com/&lt;/a&gt; .e.g. &lt;a href="http://www.brendangregg.com/offcpuanalysis.html" rel="noopener noreferrer"&gt;http://www.brendangregg.com/offcpuanalysis.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repeat the same exercise using dtrace for usage under MacOS.&lt;/li&gt;
&lt;li&gt;Automatically detect which C function does &lt;em&gt;not&lt;/em&gt; release the GIL (basically autodetect &lt;a href="https://github.com/vaexio/vaex/pull/1114" rel="noopener noreferrer"&gt;https://github.com/vaexio/vaex/pull/1114&lt;/a&gt; &lt;a href="https://github.com/apache/arrow/pull/7756" rel="noopener noreferrer"&gt;https://github.com/apache/arrow/pull/7756&lt;/a&gt; )&lt;/li&gt;
&lt;li&gt;Apply these to more issues, like in &lt;a href="https://github.com/h5py/h5py/issues/1516" rel="noopener noreferrer"&gt;https://github.com/h5py/h5py/issues/1516&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have other ideas, want to pick some of this up, leave a message or contact me.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: &lt;a href="https://github.com/maartenbreddels/per4m" rel="noopener noreferrer"&gt;https://github.com/maartenbreddels/per4m&lt;/a&gt; is under a permissive MIT license, feel free to make use of that!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;I am assuming CPython. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;Apache Arrow is a dependency of Vaex, so anytime the GIL is not released in Arrow, we (and others) suffer a performance hit. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;Except using polling &lt;a href="https://github.com/chrisjbillington/gil_load/" rel="noopener noreferrer"&gt;https://github.com/chrisjbillington/gil_load/&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>perf</category>
      <category>python</category>
      <category>gil</category>
      <category>tracing</category>
    </item>
  </channel>
</rss>
