<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Edward Mascarenhas</title>
    <description>The latest articles on DEV Community by Edward Mascarenhas (@emascarenhas).</description>
    <link>https://dev.to/emascarenhas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1123700%2F8e0ae57e-dd34-49fe-9795-908c1e2cb8fe.png</url>
      <title>DEV Community: Edward Mascarenhas</title>
      <link>https://dev.to/emascarenhas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/emascarenhas"/>
    <language>en</language>
    <item>
      <title>Migrating reduction operations to SYCL in a molecular docking application</title>
      <dc:creator>Edward Mascarenhas</dc:creator>
      <pubDate>Thu, 10 Aug 2023 22:30:31 +0000</pubDate>
      <link>https://dev.to/oneapi/migrating-reduction-operations-to-sycl-in-a-molecular-docking-application-5fkp</link>
      <guid>https://dev.to/oneapi/migrating-reduction-operations-to-sycl-in-a-molecular-docking-application-5fkp</guid>
      <description>&lt;p&gt;I completed porting of a molecular docking application from CUDA to SYCL using the Intel® DPC++ Compatibility Tool (Compatibility Tool) in June 2021. Let me share selected techniques that I used without delving into the details of the docking application. If you want to learn how to use this tool to migrate CUDA applications to SYCL, please refer to [1].&lt;/p&gt;

&lt;p&gt;The Compatibility Tool adds comments in the code where manual migration may be required. Typically, the manual changes required fall into two categories. First, changes are required for the code to compile and make the code functionally correct. Other changes are necessary to get better performance. Here, I will cover code that uses the operation of 'reduction'. Reductions are frequently used in High Performance Computing and scientific applications and can be performance hotspots. The first example finds the sum of integers and the second finds the minimum of floats and the identifier of the run that corresponds to the minimum.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integer Reductions to find the number of evaluations
&lt;/h2&gt;

&lt;p&gt;The  docking application performs integer reductions to keep a running count of the number of score evaluations. This reduction is  implemented as a multi-line macro in CUDA as shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="cp"&gt;#define REDUCEINTEGERSUM(value, pAccumulator)     
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     
    &lt;span class="p"&gt;{&lt;/span&gt;     
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pAccumulator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
    &lt;span class="p"&gt;}&lt;/span&gt;     
    &lt;span class="n"&gt;__threadfence&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;     
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;     
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__any_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;     
    &lt;span class="p"&gt;{&lt;/span&gt;     
        &lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt;            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;cData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warpmask&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
        &lt;span class="n"&gt;value&lt;/span&gt;                  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="n"&gt;value&lt;/span&gt;                  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="n"&gt;value&lt;/span&gt;                  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="n"&gt;value&lt;/span&gt;                  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="n"&gt;value&lt;/span&gt;                  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     
        &lt;span class="p"&gt;{&lt;/span&gt;     
            &lt;span class="n"&gt;atomicAdd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pAccumulator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="p"&gt;}&lt;/span&gt;     
    &lt;span class="p"&gt;}&lt;/span&gt;     
    &lt;span class="n"&gt;__threadfence&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;     
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;     
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pAccumulator&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let us review what this code is doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The code is called for each work item (thread) in a work group (warp)&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;*pAccumulator&lt;/em&gt; is where the final sum is stored summing across all work items&lt;/li&gt;
&lt;li&gt;The combination of &lt;em&gt;__threadfence()&lt;/em&gt; and &lt;em&gt;__syncthreads()&lt;/em&gt; guarantees memory consistency and synchronizes threads in the warp at the point of the call.&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;__any_sync()&lt;/em&gt; call executes the block for those non-exited threads for which &lt;em&gt;'value != 0'&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The following &lt;em&gt;__shfl_sync&lt;/em&gt; calls do a tree-wise summing with the final sum available in the first thread in the warp in variable &lt;em&gt;value&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;value&lt;/em&gt; is then added to the Accumulator atomically with &lt;em&gt;atomicAdd&lt;/em&gt; and finally all threads assign the sum to the &lt;em&gt;value&lt;/em&gt; variable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details about these CUDA calls please refer to [2].&lt;/p&gt;

&lt;p&gt;The Compatibility tool was not able to automatically migrate this code with the following comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/*
DPCT1023:40: The DPC++ sub-group does not support mask options for sycl::ext::oneapi::any_of.

DPCT1023:41: The DPC++ sub-group does not support mask options for shuffle.

DPCT1007:39: Migration of this CUDA API is not supported by the Intel(R) DPC++ Compatibility Tool.
*/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, SYCL supports a rich set of functions for performing reductions. In this case, the reduce_over_group() function in SYCL can be used to create the same functionality as the above code as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
#define REDUCEINTEGERSUM(value, pAccumulator)     
        int val = sycl::reduce_over_group(item_ct1.get_group(), value, std::plus&amp;lt;&amp;gt;());      
        *pAccumulator = val;     
        item_ct1.barrier(sycl::access::fence_space::local_space);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;em&gt;sycl::reduce_over_group&lt;/em&gt; is a collective function. The usage of this function simplifies the macro. The function takes the group, the value to be reduced, and the reduction operation which in this case is plus or summation. The function can adapt to varied sizes of work groups in SYCL and will use the best available optimizations available per the compiler and run-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the minimum energy
&lt;/h2&gt;

&lt;p&gt;In another part of the application, a block of CUDA threads perform shuffles to find the minimum of scores &lt;em&gt;v0&lt;/em&gt; and the corresponding identifier &lt;em&gt;k0&lt;/em&gt; of the run in the simulation that is the minimum score. The CUDA code calls a macro WARPMINIMUM2 (not shown) which in turn calls another macro WARPMINIMUMEXCHANGE (shown) with &lt;em&gt;mask&lt;/em&gt; set to 1, 2, 4, 8, and 16.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="cp"&gt;#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;     
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;otgx&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;otgx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;k2&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;__shfl_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;otgx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgx&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;otgx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
        &lt;span class="n"&gt;k0&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
        &lt;span class="n"&gt;v0&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;em&gt;__shfl_sync&lt;/em&gt; provides a way of moving a value from one thread to other threads in the warp in one instruction. In this code snippet &lt;em&gt;__shfl_sync&lt;/em&gt; gets the &lt;em&gt;v0&lt;/em&gt; or &lt;em&gt;k0&lt;/em&gt; value from the thread identified by the &lt;em&gt;otgx&lt;/em&gt; mask and saves it in &lt;em&gt;v2&lt;/em&gt;, &lt;em&gt;k2&lt;/em&gt; variables. We then compare &lt;em&gt;v1&lt;/em&gt; with &lt;em&gt;v2&lt;/em&gt; to set &lt;em&gt;flag&lt;/em&gt; and eventually store the minimum in &lt;em&gt;v0&lt;/em&gt; and the run identifier for this minimum in &lt;em&gt;k0&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Compatibility Tool could not completely migrate this code and included this comment as the reason it could not. However, Compatibility Tool correctly replaced the &lt;em&gt;__shfl_sync&lt;/em&gt; call with SYCL &lt;em&gt;shuffle&lt;/em&gt; call as shown in the below diff which shows the manual change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/*
DPCT1023:57: The DPC++ sub-group does not support mask options for shuffle.
*/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This comment indicates that the &lt;em&gt;shuffle&lt;/em&gt; call in SYCL does not use a mask as shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="err"&gt;#define&lt;/span&gt; WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
        {     
                float v1 = v0;     
                int k1 = k0;     
                int otgx = tgx ^ mask;     
&lt;span class="gd"&gt;-               float v2 = item_ct1.get_sub_group().shuffle(energy, otgx);     
&lt;/span&gt;&lt;span class="gi"&gt;+               float v2 = item_ct1.get_sub_group().shuffle(v0, otgx);  
&lt;/span&gt;&lt;span class="gd"&gt;-               int k2 = item_ct1.get_sub_group().shuffle(bestID, otgx);     
&lt;/span&gt;&lt;span class="gi"&gt;+               int k2 = item_ct1.get_sub_group().shuffle(k0, otgx);     
&lt;/span&gt;                int flag = ((v1 &amp;lt; v2) ^ (tgx &amp;gt; otgx)) &amp;amp;&amp;amp; (v1 != v2);     
                k0 = flag ? k1 : k2;     
                v0 = flag ? v1 : v2;     
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, Compatibility Tool performed incorrect variable substitution for &lt;em&gt;v0&lt;/em&gt; and &lt;em&gt;k0&lt;/em&gt; in the shuffle calls using &lt;em&gt;energy&lt;/em&gt; and &lt;em&gt;bestID&lt;/em&gt; variables from the caller function. We manually fixed this by replacing &lt;em&gt;energy&lt;/em&gt; with &lt;em&gt;v0&lt;/em&gt; and &lt;em&gt;bestID&lt;/em&gt; with &lt;em&gt;k0&lt;/em&gt;. This bug has been fixed in recent versions of the Compatibility Tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In summary, reduction operations in CUDA applications may not be migrated correctly by the Compatibility Tool. Review the comments provided by the tool to understand if manual migration is necessary and what change might be required. A good understanding of the original CUDA code will then help to make manual changes to develop functionally correct code in SYCL.&lt;/p&gt;

&lt;p&gt;[1] &lt;a href="https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/get-started-guide/2023-1/overview.html"&gt;https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/get-started-guide/2023-1/overview.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] &lt;a href="https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/"&gt;https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sycl</category>
      <category>tips</category>
      <category>oneapi</category>
    </item>
  </channel>
</rss>
