The post IWOMP 2024 appeared first on OpenMP.
]]>This is my end-of-the year post as we all make our way into the New Year.
We've done quite a lot of things this year to help make oneAPI easier to use - a lot of the blog posts I've written as been towards an eye to educate.
We started off with some blog posts on how to use modern IDEs on Linux to write SYCL code and run them inside a container that we built, making a turn-key effort to build applications.
There was the introduction to 'awesome oneAPI' which showed a set of curated links to oneAPI projects showcasing all the capabilities. We have been updating it regularly, so check it out! We are definitely looking for more AI related projects - are you thinking of a project for next year? Need help? Let me know!
To complement the awesome oneAPI distribution, I have recently launched the oneAPI Web Showcase where we hope to discover upcoming projects that people are working on and showcasing them. There are also links on the website to help you start a project.
We now have community-focused documentation!!
I'll have another blog post up to show how you can help with the documentation by translating the documentation into different languages so that everybody can follow along in their native language. You can see the documentation at https://oneapi-community.github.io/documentation. We hope to grow it into a true community hub for oneAPI and SYCL. Exciting! All of these are community projects in themselves. Want to help out? Reach out or just submit a PR!
I want to keep building on the work we've done in 2023. So many opportunities!! 2023 was all about meeting developers where they were. Now, it's also time to meet them where they are AND also communicate with them in their own language!!
The key to adopting open platforms is to:
1) Have great documentation that's accessible in as many languages as possible.
2) Plenty of code samples to look at how to do things.
3) Great developer experience - be able to set up your environment and just go!
4) Amazing community that interacts with each other, is active and works together.
These are all totally possible!! But, oneAPI is relatively new and still under one vendor. With the formation of the UXL Foundation, we now have a neutral place for all vendors to congregate and work together. As a community, we should ask our hardware vendors to support level zero and be able to get all the advantages of hardware with a smooth hardware experience.
So where do we want to go from here - here are my personal goals/wish list for next year!
1) reproducible builds - we should be able to continuously build and test oneAPI software.
2) More community assistance in documentation by helping translate the docs we have - as well as having more docs around tips and tricks.
3) Adding more projects to Awesome oneAPI and having more PRs from the community to add their projects! :)
With that, I wish all of you a wonderful holiday season and looking forward to great things in the oneAPI ecosystem in 2024!!
Photo by Jamie Street on Unsplash
]]>I'm back!! A few raw posts have been languishing and I decided the end of the year was the perfect time to put them out there. This will be one of three (hopefully).
I'm going to focus on how to build oneAPI from git. This is somewhat of a return to my earlier blog post where I talked about how to build the DPC++ compiler and it included the binary versions of the openCL and level zero run time.
That's all well and good but let's consider how we could build the run time from git completely. The ability to do reproducible builds is going to be important later when we dive into buildimg packaging that is up to date and available on any Linux distro.
The build only supports Intel hardware at this point since level zero doesn't support NVidia or AMD GPUs. If you are looking for such support, you might consider Codeplay's plugins that will allow you to use NVidia and AMD hardware.
These blog pages typically only focus on what we can do from an open source perspective and won't really focus on anything that has binary blobs if we can avoid it.
DISCLAIMER: Please don't use this set up for a production environment. It is not well tested. If you find any problems, please reach out in the comments so that I can help debug and update the blog post appropriately.
I like to use containers which makes it easy to quickly set up and automate using distrobox.
You should be able to use whatever Linux distribution you want as long as you can install distrobox. You can, of course, use a virtual machine to accomplish this. I've used Vagrant successfully.
Assuming that you have distrobox installed - let's get to it.
Decide where you want to have the build for instance: ~/src/oneapi-build.
$ distrobox create --image docker.io/library/ubuntu 20.04 --name "oneAPIBuild"
$ distrobox enter oneAPIBuild
You should now be in a container running Ubuntu 20.04.
$ sudo apt-get install -y build-essential git libssl-dev flex bison libz-dev python3-mako python3-pip automake autoconf libtool pkg-config ruby
You will need a recent version of cmake for the builds. The one that comes with 20.04 is too old.
$ wget https://github.com/Kitware/CMake/releases/download/v3.28.1/cmake-3.28.1.tar.gz
$ tar xvfpz cmake-3.28.1.tar.gz
$ cd cmake-3.28.1
$ ./boostrap
$ ./configure
$ make
$ sudo make install
Now you should have everything you need for the build.
There are a number of prerequisites before you start the build. Here is a graphic of how the oneAPI build is put together.
The order of build is:
1) Intel Graphics Engine and its relevant prerequisites which consist of:
That's the progression to get the full build going.
Let's start building the first prerequisites for the NEO which is the Intel Graphics Engine:
$ mkdir igc-workspace && cd igc-workspace
$ git clone https://github.com/KhronosGroup/SPIRV-Headers.git --depth 1
$ git clone https://github.com/KhronosGroup/SPIRV-Tools.git --depth 1
$ git clone -b llvmorg-14.0.5 https://github.com/llvm/llvm-project llvm-project --depth 1
$ git clone -b ocl-open-140 https://github.com/intel/opencl-clang llvm-project/llvm/projects/opencl-clang --depth 1
$ git clone -b llvm_release_140 https://github.com/KhronosGroup/SPIRV-LLVM-Translator llvm-project/llvm/projects/llvm-spirv --depth 1
$ git clone https://github.com/intel/vc-intrinsics --depth 1
$ git clone https://github.com/intel/intel-graphics-compiler igc --depth 1
$ mkdir build && cd build
$ cmake ../igc -DCMAKE_INSTALL_PREFIX="/usr/local"
$ make -j `nproc`
$ sudo make install
It should build cleanly. If it doesn't - please check any errors and make sure you have all the prerequisites.
ocl-icd is an OpenCL loader - and is used to link opencl software when compiling. Make sure you are back in your usual oneapi-build directory.
$ pwd
~/src/oneapi-build
$ git clone https://github.com/OCL-dev/ocl-icd --depth 1
$ cd ocl-icd
$ ./bootstrap
$ ./configure
$ make
$ sudo make install
NEO requires GMMLib as one of its prerequisites so we will build that now.
$ pwd
~/src/oneapi-build
$ git clone https://github.com/intel/gmmlib --depth 1
$ cd gmmlib
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX="/usr/local"
$ make
$ sudo make install
NEO is the Intel Compute Runtime and is necessary for the SYCL based applications to talk to the GPU. Go back to your ~/src/oneapi-build directory.
$ pwd
~/src/oneapi-build # please note this output will be different for you
$ mkdir neo-workspace
$ cd neo-workspace
$ git clone https://github.com/intel/compute-runtime neo –depth 1
$ cd neo
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX="/usr/local"
$ make
$ sudo make install
This is the main part of oneAPI and interfaces with NEO or other run times. Since NEO is the only one at the moment - it will only work with Intel devices.
$ pwd
~/src/oneapi-build
$ git clone https://github.com/oneapi-src/level-zero --depth 1
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX="/usr/local"
$ make
$ sudo make install
Now to build the SYCL Compiler.
$ pwd
~/src/oneapi-build
$ mkdir sycl_workspace && cd sycl_workspace
$ export DPCPP_HOME=`pwd`
$ git clone https://github.com/intel/llvm.git -b sycl --depth 1
$ python3 $DPCPP_HOME/llvm/buildbot/configure.py --cmake-opt CMAKE_BUILD_PREFIX="/usr/local"
$ python3 $DPCPP_HOME/llvm/buildbot/compile.py
Finally, we need to install oneTBB
$ pwd
~/src/oneapi-build
$ git clone https://github.com/oneapi-src/oneTBB --depth 1
$ cd oneTBB
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX="/usr/local"
$ make
$ make install
We need to make sure that the linker can find the proper libraries. The easiest way is to either set the LD_LIBRARY_PATH in your .bashrc or put it in /etc/environment.
$ export LD_LIBRARY_PATH="/usr/local/lib"
$ cd ~/src
$ mkdir simple-oneapi-app
$ cd simple-oneapi-app
$ cat > simple-oneapi-app.cpp
#include <sycl/sycl.hpp>
int main() {
// Creating buffer of 4 ints to be used inside the kernel code
sycl::buffer<sycl::cl_int, 1> Buffer(4);
// Creating SYCL queue
sycl::queue Queue;
// Size of index space for kernel
sycl::range<1> NumOfWorkItems{Buffer.size()};
// Submitting command group(work) to queue
Queue.submit([&](sycl::handler &cgh) {
// Getting write only access to the buffer on a device
auto Accessor = Buffer.get_access<sycl::access::mode::write>(cgh);
// Executing kernel
cgh.parallel_for<class FillBuffer>(
NumOfWorkItems, [=](sycl::id<1> WIid) {
// Fill buffer with indexes
Accessor[WIid] = (sycl::cl_int)WIid.get(0);
});
});
// Getting read only access to the buffer on the host.
// Implicit barrier waiting for queue to complete the work.
const auto HostAccessor = Buffer.get_access<sycl::access::mode::read>();
// Check the results
bool MismatchFound = false;
for (size_t I = 0; I < Buffer.size(); ++I) {
if (HostAccessor[I] != I) {
std::cout << "The result is incorrect for element: " << I
<< " , expected: " << I << " , got: " << HostAccessor[I]
<< std::endl;
MismatchFound = true;
}
}
if (!MismatchFound) {
std::cout << "The results are correct!" << std::endl;
}
return MismatchFound;
}
$ clang++ -fsycl simple-oneapi-app.cpp -o simple-oneapi-app
When you run the app you should get "Results are correct!".
$ ./simple-oneapi-app
Results are correct!
Now you've successfully built oneAPI from source!
Let me know if you have any issues with the instructions in the comments.
Photo by Dominik Lückmann on Unsplash
]]>This year we launched the first HPC Social Noodles Award, a celebration of our frustrations and comical takes on the events of the year. THe top noodle was, of course, the whole CentOS debacle, followed by a few gripes about software and vendors, and the funny noodles starting at item 7 and on.
HPC Social was present (and giving out stickers) at the bash this year! The branding was… excellent.
This was a parody music video made by community leader @vsoch to celebrate a generic HPC technology (MPI) in the high performance community!
She made an effort to engage others to participate, and was only moderately successful to get a few shared pictures. It would be a fun idea if others wanted to participate to a greater extent at some future Supercomputing!
The “official” greeting for SC23 was tapping someone on the shoulder, as announced by HPC Guru.
Our very own Alan Sill hosted a booth to show up a tiny cluster! While Raspberry Pi clusters have been around for a long time and useful in hobbyist activities, training, home automation, and training, this was one of the first such small clusters running Fedora 39 as a natively installed OS on the head node and Enterprise Linux (in this case Rocky) on the worker nodes. More to come as other mainline popular cluster tools like Warewulf, Spack and/or EasyBuild, and Slurm and/or Flux schedulars are added.
And finally, we close with a few shots shared in the HPC Social slack! We love our community! ❤️
Felix (finally) got his “I am HPC Guru” pin!
]]>The post Supercomputing 2023 appeared first on OpenMP.
]]>The post OpenMP ARB Releases Technical Report 12 appeared first on OpenMP.
]]>The post Fortran Package Manager and OpenMP appeared first on OpenMP.
]]>For a long period of time, incubation of the oneAPI ecosystem started at Intel through both DPC++ SYCL compiler, and the publishing of open specifications of oneAPI and their open source implementations. While the messaging around the specs were that they were open to all contributors – it could be hard to feel comfortable especially for competitors to enter spaces where the perception is that it isn’t a true neutral space.
With oneAPI spec’s now under the aegis of the Linux Foundation the spec and the open source implementations will be well managed under established norms. There is now a true center of gravity to work together as equal partners on the oneAPI spec and their open source implementations.
We can now focus on driving a true industry driven standard on heterogeneous computing under the UXL Foundation and have some serious collaboration to finally use all your hardware. This will herald a sustainable ecosystem that we can all be proud of.
There are still challenges going forward. How our toolchains work together under UXL Foundation is going to be important going forward. We can address these concerns by vigorously participating in the UXL Foundation. Creating an open ecosystem is always challenging as many many partners need to agree and align on goals and processes. Listening to each other and the community is going to be key going forward.
With all that being said - I hope that you will take the time to look at what has been established so far. We have a humble beginning but with your help and participation we can take it to the next level. For further reading, please see https://uxlfoundation.org/. Looking forward to seeing you all there.
Cover image: Photo by Scott Blake on Unsplash
]]>The post IWOMP 2023 appeared first on OpenMP.
]]>The Compatibility Tool adds comments in the code where manual migration may be required. Typically, the manual changes required fall into two categories. First, changes are required for the code to compile and make the code functionally correct. Other changes are necessary to get better performance. Here, I will cover code that uses the operation of 'reduction'. Reductions are frequently used in High Performance Computing and scientific applications and can be performance hotspots. The first example finds the sum of integers and the second finds the minimum of floats and the identifier of the run that corresponds to the minimum.
The docking application performs integer reductions to keep a running count of the number of score evaluations. This reduction is implemented as a multi-line macro in CUDA as shown below.
#define REDUCEINTEGERSUM(value, pAccumulator)
if (threadIdx.x == 0)
{
*pAccumulator = 0;
}
__threadfence();
__syncthreads();
if (__any_sync(0xffffffff, value != 0))
{
uint32_t tgx = threadIdx.x & cData.warpmask;
value += __shfl_sync(0xffffffff, value, tgx ^ 1);
value += __shfl_sync(0xffffffff, value, tgx ^ 2);
value += __shfl_sync(0xffffffff, value, tgx ^ 4);
value += __shfl_sync(0xffffffff, value, tgx ^ 8);
value += __shfl_sync(0xffffffff, value, tgx ^ 16);
if (tgx == 0)
{
atomicAdd(pAccumulator, value);
}
}
__threadfence();
__syncthreads();
value = *pAccumulator;
__syncthreads();
Let us review what this code is doing:
For more details about these CUDA calls please refer to [2].
The Compatibility tool was not able to automatically migrate this code with the following comments.
/*
DPCT1023:40: The DPC++ sub-group does not support mask options for sycl::ext::oneapi::any_of.
DPCT1023:41: The DPC++ sub-group does not support mask options for shuffle.
DPCT1007:39: Migration of this CUDA API is not supported by the Intel(R) DPC++ Compatibility Tool.
*/
However, SYCL supports a rich set of functions for performing reductions. In this case, the reduce_over_group() function in SYCL can be used to create the same functionality as the above code as follows.
#define REDUCEINTEGERSUM(value, pAccumulator)
int val = sycl::reduce_over_group(item_ct1.get_group(), value, std::plus<>());
*pAccumulator = val;
item_ct1.barrier(sycl::access::fence_space::local_space);
The sycl::reduce_over_group is a collective function. The usage of this function simplifies the macro. The function takes the group, the value to be reduced, and the reduction operation which in this case is plus or summation. The function can adapt to varied sizes of work groups in SYCL and will use the best available optimizations available per the compiler and run-time.
In another part of the application, a block of CUDA threads perform shuffles to find the minimum of scores v0 and the corresponding identifier k0 of the run in the simulation that is the minimum score. The CUDA code calls a macro WARPMINIMUM2 (not shown) which in turn calls another macro WARPMINIMUMEXCHANGE (shown) with mask set to 1, 2, 4, 8, and 16.
#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)
{
float v1 = v0;
int k1 = k0;
int otgx = tgx ^ mask;
float v2 = __shfl_sync(0xffffffff, v0, otgx);
int k2 = __shfl_sync(0xffffffff, k0, otgx);
int flag = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);
k0 = flag ? k1 : k2;
v0 = flag ? v1 : v2;
}
The __shfl_sync provides a way of moving a value from one thread to other threads in the warp in one instruction. In this code snippet __shfl_sync gets the v0 or k0 value from the thread identified by the otgx mask and saves it in v2, k2 variables. We then compare v1 with v2 to set flag and eventually store the minimum in v0 and the run identifier for this minimum in k0.
Compatibility Tool could not completely migrate this code and included this comment as the reason it could not. However, Compatibility Tool correctly replaced the __shfl_sync call with SYCL shuffle call as shown in the below diff which shows the manual change.
/*
DPCT1023:57: The DPC++ sub-group does not support mask options for shuffle.
*/
This comment indicates that the shuffle call in SYCL does not use a mask as shown below.
#define WARPMINIMUMEXCHANGE(tgx, v0, k0, mask)
{
float v1 = v0;
int k1 = k0;
int otgx = tgx ^ mask;
- float v2 = item_ct1.get_sub_group().shuffle(energy, otgx);
+ float v2 = item_ct1.get_sub_group().shuffle(v0, otgx);
- int k2 = item_ct1.get_sub_group().shuffle(bestID, otgx);
+ int k2 = item_ct1.get_sub_group().shuffle(k0, otgx);
int flag = ((v1 < v2) ^ (tgx > otgx)) && (v1 != v2);
k0 = flag ? k1 : k2;
v0 = flag ? v1 : v2;
}
In this case, Compatibility Tool performed incorrect variable substitution for v0 and k0 in the shuffle calls using energy and bestID variables from the caller function. We manually fixed this by replacing energy with v0 and bestID with k0. This bug has been fixed in recent versions of the Compatibility Tool.
In summary, reduction operations in CUDA applications may not be migrated correctly by the Compatibility Tool. Review the comments provided by the tool to understand if manual migration is necessary and what change might be required. A good understanding of the original CUDA code will then help to make manual changes to develop functionally correct code in SYCL.
[2] https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
]]>