hpc.social


High Performance Computing
Practitioners and friends
Community Syndicated Blog

Share: 
This is a crosspost from   DEV Community: oneAPI Community The latest articles on DEV Community by oneAPI Community (@oneapi).. See the original post here.

Migrating reduction operations to SYCL in a molecular docking application

I completed porting of a molecular docking application from CUDA to SYCL using the Intel® DPC++ Compatibility Tool (Compatibility Tool) in June 2021. Let me share selected techniques that I used without delving into the details of the docking application. If you want to learn how to use this tool to migrate CUDA applications to SYCL, please refer to [1].

The Compatibility Tool adds comments in the code where manual migration may be required. Typically, the manual changes required fall into two categories. First, changes are required for the code to compile and make the code functionally correct. Other changes are necessary to get better performance. Here, I will cover code that uses the operation of 'reduction'. Reductions are frequently used in High Performance Computing and scientific applications and can be performance hotspots. The first example finds the sum of integers and the second finds the minimum of floats and the identifier of the run that corresponds to the minimum.

Integer Reductions to find the number of evaluations

The docking application performs integer reductions to keep a running count of the number of score evaluations. This reduction is implemented as a multi-line macro in CUDA as shown below.

#define REDUCEINTEGERSUM(value, pAccumulator)     
    if (threadIdx.x == 0)     
    {     
        *pAccumulator = 0;     
    }     
    __threadfence();     
    __syncthreads();     
    if (__any_sync(0xffffffff, value != 0))     
    {     
        uint32_t tgx            = threadIdx.x & cData.warpmask;     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 1);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 2);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 4);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 8);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 16);     
        if (tgx == 0)     
        {     
            atomicAdd(pAccumulator, value);     
        }     
    }     
    __threadfence();     
    __syncthreads();     
    value = *pAccumulator;     
    __syncthreads();

Let us review what this code is doing:

For more details about these CUDA calls please refer to [2].

The Compatibility tool was not able to automatically migrate this code with the following comments.

/*
DPCT1023:40: The DPC++ sub-group does not support mask options for sycl::ext::oneapi::any_of.

DPCT1023:41: The DPC++ sub-group does not support mask options for shuffle.

DPCT1007:39: Migration of this CUDA API is not supported by the Intel(R) DPC++ Compatibility Tool.
*/

However, SYCL supports a rich set of functions for performing reductions. In this case, the reduce_over_group() function in SYCL can be used to create the same functionality as the above code as follows.


#define REDUCEINTEGERSUM(value, pAccumulator)     
        int val = sycl::reduce_over_group(item_ct1.get_group(), value, std::plus<>());      
        *pAccumulator = val;     
        item_ct1.barrier(sycl::access::fence_space::local_space);

The sycl::reduce_over_group is a collective function. The usage of this function simplifies the macro. The function takes the group, the value to be reduced, and the reduction operation which in this case is plus or summation. The function can adapt to varied sizes of work groups in SYCL and will use the best available optimizations available per the compiler and run-time.

Finding the minimum energy

In another part of the application, a block of CUDA threads perform shuffles to find the minimum of scores v0 and the corresponding identifier k0 of the run in the simulation that is the minimum score. The CUDA code calls a macro WARPMINIMUM2 (not shown) which in turn calls another macro WARPMINIMUMEXCHANGE (shown) with mask set to 1, 2, 4, 8, and 16.

#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
    {     
        float v1    = v0;     
        int k1      = k0;     
        int otgx    = tgx ^ mask;     
        float v2    = __shfl_sync(0xffffffff, v0, otgx);     
        int k2      = __shfl_sync(0xffffffff, k0, otgx);     
        int flag    = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);     
        k0          = flag ? k1 : k2;     
        v0          = flag ? v1 : v2;     
    }

The __shfl_sync provides a way of moving a value from one thread to other threads in the warp in one instruction. In this code snippet __shfl_sync gets the v0 or k0 value from the thread identified by the otgx mask and saves it in v2, k2 variables. We then compare v1 with v2 to set flag and eventually store the minimum in v0 and the run identifier for this minimum in k0.

Compatibility Tool could not completely migrate this code and included this comment as the reason it could not. However, Compatibility Tool correctly replaced the __shfl_sync call with SYCL shuffle call as shown in the below diff which shows the manual change.

/*
DPCT1023:57: The DPC++ sub-group does not support mask options for shuffle.
*/

This comment indicates that the shuffle call in SYCL does not use a mask as shown below.

#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
        {     
                float v1 = v0;     
                int k1 = k0;     
                int otgx = tgx ^ mask;     
-               float v2 = item_ct1.get_sub_group().shuffle(energy, otgx);     
+               float v2 = item_ct1.get_sub_group().shuffle(v0, otgx);  
-               int k2 = item_ct1.get_sub_group().shuffle(bestID, otgx);     
+               int k2 = item_ct1.get_sub_group().shuffle(k0, otgx);     
                int flag = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);     
                k0 = flag ? k1 : k2;     
                v0 = flag ? v1 : v2;     
        }

In this case, Compatibility Tool performed incorrect variable substitution for v0 and k0 in the shuffle calls using energy and bestID variables from the caller function. We manually fixed this by replacing energy with v0 and bestID with k0. This bug has been fixed in recent versions of the Compatibility Tool.

Summary

In summary, reduction operations in CUDA applications may not be migrated correctly by the Compatibility Tool. Review the comments provided by the tool to understand if manual migration is necessary and what change might be required. A good understanding of the original CUDA code will then help to make manual changes to develop functionally correct code in SYCL.

[1] https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/get-started-guide/2023-1/overview.html

[2] https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/