MNT Reform 2 - part deux

A few days back I posted some of my initial thoughts of the MNT Reform 2 laptop which just recently arrived. I ran the usual battery of tests on the laptop including the High Performance Linpack (HPL) of course just for kicks.

At that time, I made no attempt to optmize HPL. I simply went with the OS supplied gcc and math libraries. My next step was to look at how I could improve my HPL result using the Arm compiler for Linux and the Arm performance libraries. Here I’ll walk through those steps from installing the Arm tools, to compiling and running HPL - and all of the small details in between.

(1) To start, I downloaded the latest verion of the Arm compiler for Linux package from here. This was the package with the filename: arm-compiler-for-linux_22.0.2_Ubuntu-20.04_aarch64.tar.

(2) After uncompressing arm-compiler-for-linux_22.0.2_Ubuntu-20.04_aarch64.tar, I ran the installation command ./arm-compiler-for-linux_22.0.2_Ubuntu-20.04.sh -a which installed the software to /opt/arm on the system. Note that the Arm compilers for Linux ship with module files to make setting up the envionment for compiling easy. To support this I had to install the OS environment-modules package with apt-get install environment-modules

(3) In order to load the module for the Arm compiler for Linux, the following steps are necessary. This assumes that the Arm compiler for Linux is installed in /opt/arm.

    
root@reform:/# module avail
----------------------------------- /usr/share/modules/modulefiles ------------------------------------
dot  module-git  module-info  modules  null  use.own  

Key:
modulepath  
root@reform:/# export MODULEPATH=/opt/arm/modulefiles:$MODULEPATH
root@reform:/# module avail
---------------------------------------- /opt/arm/modulefiles -----------------------------------------
acfl/22.0.2  binutils/11.2.0  gnu/11.2.0  

----------------------------------- /usr/share/modules/modulefiles ------------------------------------
dot  module-git  module-info  modules  null  use.own  

Key:
modulepath  
root@reform:/# module load acfl/22.0.2
Loading acfl/22.0.2
  Loading requirement: binutils/11.2.0
root@reform:/# echo $PATH
/opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/bin:/opt/arm/gcc-11.2.0_Generic-AArch64_Ubuntu-20.04_aarch64-linux/binutils_bin:/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
root@reform:/# armclang --version
Arm C/C++/Fortran Compiler version 22.0.2 (build number 1776) (based on LLVM 13.0.0)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/bin

(4) Now we shift our focus to Open MPI. Open MPI is an open source distribution of the message passing interface (MPI) library for writing parallel applications. We will compile HPL against this Open MPI version. For this, I downloaded the latest Open MPI version (4.1.4) from here.

By default, Open MPI compiles with support for the SLURM workload manager. My Reform has IBM Spectrum LSF installed as the workload scheduler. In order to enable LSF support in Open MPI, we need to specify the appropriate configure flags (see below).

root@reform:/opt/HPC/openmpi-4.1.4# ./configure --prefix=/opt/HPC/openmpi-4.1.4 --with-lsf=/opt/ibm/lsf/10.1 --with-lsf-libdir=/opt/ibm/lsf/10.1/linux3.12-glibc2.17-armv8/lib

root@reform:/opt/HPC/openmpi-4.1.4# make -j 4
...
...

root@reform:/opt/HPC/openmpi-4.1.4# make install
...
...

(5) After completing the compilation of Open MPI, the ompi_info command is run to check if support for LSF has been enabled. Note that you must ensure to source the LSF environment (i.e. . ./profile.lsf) before running ompi_info or the LSF libraries won’t be found.

root@reform:/opt/HPC/openmpi-4.1.4# ./bin/ompi_info |grep -i lsf
  Configure command line: '--prefix=/opt/HPC/openmpi-4.1.4' '--with-lsf=/opt/ibm/lsf/10.1' '--with-lsf-libdir=/opt/ibm/lsf/10.1/linux3.12-glibc2.17-armv8/lib'
                 MCA ess: lsf (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA plm: lsf (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: lsf (MCA v2.1.0, API v2.0.0, Component v4.1.4)

(6) Next, I downloaded the latest HPL package from here. I uncompressed the the package hpl-2.3.tar.gz in the /opt/HPC directory. Next, I had to create a new Makefile for HPL which would use the Arm compiler for Linux and optmized math libraries. A copy of Make.imx8qm follows below.

#  
#  -- High Performance Computing Linpack Benchmark (HPL)                
#     HPL - 2.3 - December 2, 2018                          
#     Antoine P. Petitet                                                
#     University of Tennessee, Knoxville                                
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.        
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution. 
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                          
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
# ######################################################################
#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = imx8qm
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /opt/HPC/hpl-2.3
INCdir       = /opt/HPC/hpl-2.3/include
BINdir       = /opt/HPC/hpl-2.3/bin/$(ARCH)
LIBdir       = /opt/HPC/hpl-2.3/lib/$(ARCH)
#
HPLlib       = /opt/HPC/hpl-2.3/lib/libhpl.a 
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /opt/HPC/openmpi-4.1.4
MPinc        = /opt/HPC/openmpi-4.1.4/include
MPlib        = /opt/HPC/openmpi-4.1.4/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        =
LAinc        =
# LAlib        = -lamath -lm -mcpu=native
LAlib        = 
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = 
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) -I$(MPinc) -I/opt/arm/armpl-22.0.2_AArch64_Ubuntu-20.04_gcc_aarch64-linux/include/
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     =
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = armclang 
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -O3 -larmpl_lp64 -lamath -lm 
#
LINKER       = armclang -O3 -armpl -lamath -lm 
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

(7) To compile HPL with the above Makefile is as simple as running the appropriate make command and specify the architecture imx8qm.

root@reform:/opt/HPC/hpl-2.3# make arch=imx8qm
...
...

(8) Barring any errors, we should now have an xhpl binary in under the /opt/HPC/hpl-2.3/bin/imx8qm directory.

root@reform:/opt/HPC/hpl-2.3/bin/imx8qm# pwd
/opt/HPC/hpl-2.3/bin/imx8qm
root@reform:/opt/HPC/hpl-2.3/bin/imx8qm# ls -la
total 156
drwxr-xr-x 2 root root   4096 Jun  8 13:30 .
drwxr-xr-x 3 root root   4096 Jun  8 13:20 ..
-rw-r--r-- 1 root root   1454 Jun  8 13:30 HPL.dat
-rwxr-xr-x 1 root root 146960 Jun  8 13:24 xhpl
root@reform:/opt/HPC/hpl-2.3/bin/imx8qm# ldd ./xhpl
	linux-vdso.so.1 (0x0000007faa7b1000)
	libamath_aarch64.so => /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/llvm-bin/../lib/libamath_aarch64.so (0x0000007faa5ef000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007faa520000)
	libarmpl_lp64.so => /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/lib/clang/13.0.0/armpl_links/lib/libarmpl_lp64.so (0x0000007fa3cd5000)
	libmpi.so.40 => /usr/lib/aarch64-linux-gnu/libmpi.so.40 (0x0000007fa3b8f000)
	libarmflang.so => /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/llvm-bin/../lib/libarmflang.so (0x0000007fa3728000)
	libomp.so => /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/llvm-bin/../lib/libomp.so (0x0000007fa3649000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007fa3631000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007fa361d000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007fa35ed000)
	libastring_aarch64.so => /opt/arm/arm-linux-compiler-22.0.2_Generic-AArch64_Ubuntu-20.04_aarch64-linux/llvm-bin/../lib/libastring_aarch64.so (0x0000007fa35da000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007fa345f000)
	/lib/ld-linux-aarch64.so.1 (0x0000007faa77e000)
	libgcc_s.so.1 => /opt/arm/gcc-11.2.0_Generic-AArch64_Ubuntu-20.04_aarch64-linux/lib64/libgcc_s.so.1 (0x0000007fa343a000)
	libopen-rte.so.40 => /usr/lib/aarch64-linux-gnu/libopen-rte.so.40 (0x0000007fa336c000)
	libopen-pal.so.40 => /usr/lib/aarch64-linux-gnu/libopen-pal.so.40 (0x0000007fa32aa000)
	libhwloc.so.15 => /usr/lib/aarch64-linux-gnu/libhwloc.so.15 (0x0000007fa3245000)
	libstdc++.so.6 => /opt/arm/gcc-11.2.0_Generic-AArch64_Ubuntu-20.04_aarch64-linux/lib64/libstdc++.so.6 (0x0000007fa3030000)
	libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000007fa3006000)
	libevent_core-2.1.so.7 => /usr/lib/aarch64-linux-gnu/libevent_core-2.1.so.7 (0x0000007fa2fbf000)
	libutil.so.1 => /lib/aarch64-linux-gnu/libutil.so.1 (0x0000007fa2fab000)
	libevent_pthreads-2.1.so.7 => /usr/lib/aarch64-linux-gnu/libevent_pthreads-2.1.so.7 (0x0000007fa2f98000)
	libudev.so.1 => /usr/lib/aarch64-linux-gnu/libudev.so.1 (0x0000007fa2f5e000)

(9) A default HPL.dat file should ber present in the directory /opt/HPC/hpl-2.3/bin/imx8qm. The file HPL.dat is used to tune the benchmark problem size according to the system. A copy of the HPL.dat file I created follows below. This is suitable for the 4 GB memory configuration of Reform with 4 processor cores.

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any) 
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
19000         Ns
1            # of NBs
192           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

(10) Now we’re ready to execute the appropriate mpirun command to run the xhpl executable. We specify -np 4 to run across the 4 cores of the processor. With this better optimized run we’re seeing ~8.9 GFLOPS performance compared with ~4 GFLOPS for my previous runs where HPL was compiled with the OS supplied GCC and Math libraries (ATLAS). Note that as this is roughly double the GFLOPS from my previous runs, it appears that there is an issue with double precision or perhaps vectorization with the non-optimized runs.

gsamu@reform:/opt/HPC/hpl-2.3/bin/imx8qm$ mpirun -np 4 ./xhpl 
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   19000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       19000   192     2     2             513.92             8.8987e+00
HPL_pdgesv() start time Wed Jun  8 21:28:07 2022

HPL_pdgesv() end time   Wed Jun  8 21:36:41 2022

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.89711678e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

(11) Finally, we submit the same run of Linpack but through Spectrum LSF. The LSF bsub command invocation is shown below and the resulting output.

gsamu@reform:~$ bsub -n 4 -I -m reform "cd /opt/HPC/hpl-2.3/bin/imx8qm ; mpirun ./xhpl" 
Job <35301> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on reform>>
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   19000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       19000   192     2     2             518.02             8.8283e+00
HPL_pdgesv() start time Thu Jun  9 09:33:35 2022

HPL_pdgesv() end time   Thu Jun  9 09:42:13 2022

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.89711678e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

hpc.social

MNT Reform 2 - part deux