Migrating reduction operations to SYCL in a molecular docking application

I completed porting of a molecular docking application from CUDA to SYCL using the Intel® DPC++ Compatibility Tool (Compatibility Tool) in June 2021. Let me share selected techniques that I used without delving into the details of the docking application. If you want to learn how to use this tool to migrate CUDA applications to SYCL, please refer to [1].

The Compatibility Tool adds comments in the code where manual migration may be required. Typically, the manual changes required fall into two categories. First, changes are required for the code to compile and make the code functionally correct. Other changes are necessary to get better performance. Here, I will cover code that uses the operation of 'reduction'. Reductions are frequently used in High Performance Computing and scientific applications and can be performance hotspots. The first example finds the sum of integers and the second finds the minimum of floats and the identifier of the run that corresponds to the minimum.

Integer Reductions to find the number of evaluations

The docking application performs integer reductions to keep a running count of the number of score evaluations. This reduction is implemented as a multi-line macro in CUDA as shown below.

#define REDUCEINTEGERSUM(value, pAccumulator)     
    if (threadIdx.x == 0)     
    {     
        *pAccumulator = 0;     
    }     
    __threadfence();     
    __syncthreads();     
    if (__any_sync(0xffffffff, value != 0))     
    {     
        uint32_t tgx            = threadIdx.x & cData.warpmask;     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 1);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 2);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 4);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 8);     
        value                  += __shfl_sync(0xffffffff, value, tgx ^ 16);     
        if (tgx == 0)     
        {     
            atomicAdd(pAccumulator, value);     
        }     
    }     
    __threadfence();     
    __syncthreads();     
    value = *pAccumulator;     
    __syncthreads();

Let us review what this code is doing:

The code is called for each work item (thread) in a work group (warp)
*pAccumulator is where the final sum is stored summing across all work items
The combination of __threadfence() and __syncthreads() guarantees memory consistency and synchronizes threads in the warp at the point of the call.
The __any_sync() call executes the block for those non-exited threads for which 'value != 0'
The following __shfl_sync calls do a tree-wise summing with the final sum available in the first thread in the warp in variable value
The value is then added to the Accumulator atomically with atomicAdd and finally all threads assign the sum to the value variable.

For more details about these CUDA calls please refer to [2].

The Compatibility tool was not able to automatically migrate this code with the following comments.

/*
DPCT1023:40: The DPC++ sub-group does not support mask options for sycl::ext::oneapi::any_of.

DPCT1023:41: The DPC++ sub-group does not support mask options for shuffle.

DPCT1007:39: Migration of this CUDA API is not supported by the Intel(R) DPC++ Compatibility Tool.
*/

However, SYCL supports a rich set of functions for performing reductions. In this case, the reduce_over_group() function in SYCL can be used to create the same functionality as the above code as follows.


#define REDUCEINTEGERSUM(value, pAccumulator)     
        int val = sycl::reduce_over_group(item_ct1.get_group(), value, std::plus<>());      
        *pAccumulator = val;     
        item_ct1.barrier(sycl::access::fence_space::local_space);

The sycl::reduce_over_group is a collective function. The usage of this function simplifies the macro. The function takes the group, the value to be reduced, and the reduction operation which in this case is plus or summation. The function can adapt to varied sizes of work groups in SYCL and will use the best available optimizations available per the compiler and run-time.

Finding the minimum energy

In another part of the application, a block of CUDA threads perform shuffles to find the minimum of scores v0 and the corresponding identifier k0 of the run in the simulation that is the minimum score. The CUDA code calls a macro WARPMINIMUM2 (not shown) which in turn calls another macro WARPMINIMUMEXCHANGE (shown) with mask set to 1, 2, 4, 8, and 16.

#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
    {     
        float v1    = v0;     
        int k1      = k0;     
        int otgx    = tgx ^ mask;     
        float v2    = __shfl_sync(0xffffffff, v0, otgx);     
        int k2      = __shfl_sync(0xffffffff, k0, otgx);     
        int flag    = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);     
        k0          = flag ? k1 : k2;     
        v0          = flag ? v1 : v2;     
    }

The __shfl_sync provides a way of moving a value from one thread to other threads in the warp in one instruction. In this code snippet __shfl_sync gets the v0 or k0 value from the thread identified by the otgx mask and saves it in v2, k2 variables. We then compare v1 with v2 to set flag and eventually store the minimum in v0 and the run identifier for this minimum in k0.

Compatibility Tool could not completely migrate this code and included this comment as the reason it could not. However, Compatibility Tool correctly replaced the __shfl_sync call with SYCL shuffle call as shown in the below diff which shows the manual change.

/*
DPCT1023:57: The DPC++ sub-group does not support mask options for shuffle.
*/

This comment indicates that the shuffle call in SYCL does not use a mask as shown below.

#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)     
        {     
                float v1 = v0;     
                int k1 = k0;     
                int otgx = tgx ^ mask;     
-               float v2 = item_ct1.get_sub_group().shuffle(energy, otgx);     
+               float v2 = item_ct1.get_sub_group().shuffle(v0, otgx);  
-               int k2 = item_ct1.get_sub_group().shuffle(bestID, otgx);     
+               int k2 = item_ct1.get_sub_group().shuffle(k0, otgx);     
                int flag = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);     
                k0 = flag ? k1 : k2;     
                v0 = flag ? v1 : v2;     
        }

In this case, Compatibility Tool performed incorrect variable substitution for v0 and k0 in the shuffle calls using energy and bestID variables from the caller function. We manually fixed this by replacing energy with v0 and bestID with k0. This bug has been fixed in recent versions of the Compatibility Tool.

Summary

In summary, reduction operations in CUDA applications may not be migrated correctly by the Compatibility Tool. Review the comments provided by the tool to understand if manual migration is necessary and what change might be required. A good understanding of the original CUDA code will then help to make manual changes to develop functionally correct code in SYCL.

[1] https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/get-started-guide/2023-1/overview.html

[2] https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

hpc.social

Migrating reduction operations to SYCL in a molecular docking application

Integer Reductions to find the number of evaluations

Finding the minimum energy

Summary