hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Blogs on Technical Computing Goulash Recent content in Blogs on Technical Computing Goulash. See the original post here.

Pi in the sky? A compute cluster in mini ITX form factor

Overview

It’s taken me a while to get the wheels off the ground in 2024 in terms of blogging. This blog idea has been in the works actually for some time. Back in 2021, I wrote a blog titled Late to the party and a few bits short. This was a tongue in cheek title for a blog on the Novena Desktop System, which is based on a 32-bit processor, hence a few bits short. And late to the party referring to the fact that I was very late to purchase a second-hand Novena system.

This blog is similar in that it’s about the original Turing Pi V1 system which was released back in 2021 when the Turing Pi V2 launch was imminent. The Turing Pi V1 is a 7 node cluster in a mini-ITX form factor. It’s based on the Raspberry Pi CM3(+) modules. This was really an impulse purchase the dark days of COVID. And as I found out, getting a hold of RPi CM3’s was much harder than expected. As luck would have it, I even eventually found a source via an online marketplace here in Southern Ontario that was not charging and arm and a leg for them. I purchased a total of 7 CM3+ modules with no onboard storage and relied upon SD cards for storage. As (bad) luck would have it, I ended up having to purchase a CM3 with onboard storage because one of the SD card slots is defecting on the board; the spring mechanism doesn’t work properly. And as we’ll see later on, this also had an unusual side effect when running Linpack.

I’ve had the fully populated system for about 6 months now. And although the Turing Pi V1 is old news at this stage, I still wanted to write a bit about my experience with it. And of course, because it’s a cluster, I definitely wanted to put it through it’s paces running Linpack.

The official Turing Pi V1 documentation was my goto for the system setup. The cluster was installed with the latest (at the time) Raspberry Pi OS (2023-02-21-raspios-bullseye-arm64-lite.img) based on Debian 11 (Bullseye).

The following additional software packages were installed/compiled. Note that the head node of the cluster acts as an NFS server for the remaining cluster nodes (/opt).

Here is the output of the LSF lshosts command. We see 6 CM3+ systems detected, and one CM3. Note that this required additional LSF configuration.

lsfadmin@turingpi:/opt/HPC/hpl-2.3 $ lshosts -w
HOST_NAME                       type       model  cpuf ncpus maxmem maxswp server RESOURCES
turingpi                  LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes (mg)
neumann                   LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes ()
teller                    LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes ()
szilard                   LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes ()
wigner                    LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes ()
kemeny                    LINUX_ARM64         CM3   5.0     4   910M   100M    Yes ()
vonkarman                 LINUX_ARM64     CM3plus   6.0     4   910M   100M    Yes ()

Those with a keen eye will note that the majority of the cluster nodes are named after Hungarian scient ists:

The odd one out here is of course turingpi, which is the name of the head node of the cluster, and is o f course named after Alan Turing. But I digress.

For completeness, HPL V2.3 was compiled using the Arm Optimizing Compilers with the follwing flags:

For the first HPL run, we submit the job requesting a total of 24 cores. There are a total of 28 cores in the cluster, but we’ve isolated the head node of the cluster as it’s the NFS server for the environm ent. We see that the head node turingpi shows a closed status here, meaning that it won’t accept any jo bs from LSF.

lsfadmin@turingpi:/opt/HPC/hpl-2.3/bin/cm3_3 $ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
kemeny             ok              -      4      0      0      0      0      0
neumann            ok              -      4      0      0      0      0      0
szilard            ok              -      4      0      0      0      0      0
teller             ok              -      4      0      0      0      0      0
turingpi           closed          -      4      0      0      0      0      0
vonkarman          ok              -      4      0      0      0      0      0
wigner             ok              -      4      0      0      0      0      0

Turing up the heat - literally

Submit HPL using the LSF bsub command requesting 24 cores in the cluser with core affinity specified.

lsfadmin@turingpi:/opt/HPC/hpl-2.3/bin/cm3_3 $ bsub -n 24 -R "affinity[core(1)]" -Is mpirun --mca btl_t
cp_if_exclude lo,docker0 ./xhpl
Job <41861> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on neumann>>
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   15968 
NB     :      48       96      192 
PMAP   : Row-major process mapping
P      :       4        6 
Q      :       6        4 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: teller
  Local PID:  2253
  Peer host:  kemeny
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 23 with PID 2448 on node kemeny exited on signal 4 (Illegal instructio
n).
--------------------------------------------------------------------------

We see above that the MPI rank(s) fail on host kemeny, which happens to be the CM3 module (not CM3+). E ven though I compiled HPL natively on kemeny this issue persists. So ultimately, the HPL run was limite d to the 5 remaining CM3+ nodes (i.e. 20 cores).

Next, we submit HPL requesting 20 cores (all on CM3+ modules). Core affinity is specified, and we reque st specifically the model type “CM3plus”. The job was submitted interactively and the output follows:

lsfadmin@turingpi:/opt/HPC/hpl-2.3/bin/cm3_3 $ bsub -n 20 -Is -R "select[model==CM3plus] affinity[core(
1)]" mpirun --mca btl_tcp_if_exclude lo,docker0 ./xhpl
Job <41865> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on vonkarman>>
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   15968 
NB     :      48       96      192 
PMAP   : Row-major process mapping
P      :       4        5 
Q      :       5        4 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    48     4     5             327.96             8.2776e+00
HPL_pdgesv() start time Sun Mar  3 20:29:45 2024

HPL_pdgesv() end time   Sun Mar  3 20:35:13 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.74851526e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    96     4     5             315.71             8.5987e+00
HPL_pdgesv() start time Sun Mar  3 20:35:18 2024

HPL_pdgesv() end time   Sun Mar  3 20:40:34 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.82600703e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968   192     4     5             319.93             8.4854e+00
HPL_pdgesv() start time Sun Mar  3 20:40:38 2024

HPL_pdgesv() end time   Sun Mar  3 20:45:58 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.56990081e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    48     5     4             342.36             7.9293e+00
HPL_pdgesv() start time Sun Mar  3 20:46:03 2024

HPL_pdgesv() end time   Sun Mar  3 20:51:45 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.89956630e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    96     5     4             313.72             8.6531e+00
HPL_pdgesv() start time Sun Mar  3 20:51:50 2024

HPL_pdgesv() end time   Sun Mar  3 20:57:04 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.04113830e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968   192     5     4             312.48             8.6877e+00
HPL_pdgesv() start time Sun Mar  3 20:57:08 2024

HPL_pdgesv() end time   Sun Mar  3 21:02:21 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.30812017e-03 ...... PASSED
================================================================================

Finished      6 tests with the following results:
              6 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

We oberved during the HPL run that the CPU temperatures exceeded 80 degrees Celsius:

root@turingpi:/home/lsfadmin# parallel-ssh -h /opt/workers -i "/opt/tools/cputemp.sh"
[1] 20:47:30 [SUCCESS] kemeny
Current CPU temperature is 61.22 degrees Celsius.
[2] 20:47:30 [SUCCESS] teller
Current CPU temperature is 82.21 degrees Celsius.
[3] 20:47:30 [SUCCESS] wigner
Current CPU temperature is 82.74 degrees Celsius.
[4] 20:47:31 [SUCCESS] szilard
Current CPU temperature is 82.21 degrees Celsius.
[5] 20:47:31 [SUCCESS] neumann
Current CPU temperature is 82.74 degrees Celsius.
[6] 20:47:31 [SUCCESS] vonkarman
Current CPU temperature is 83.28 degrees Celsius.

root@turingpi:/home/lsfadmin# parallel-ssh -h /opt/workers -i "/usr/bin/vcgencmd measure_clock arm" 
[1] 20:47:42 [SUCCESS] kemeny
frequency(48)=1199998000
[2] 20:47:43 [SUCCESS] szilard
frequency(48)=1034000000
[3] 20:47:43 [SUCCESS] teller
frequency(48)=980000000
[4] 20:47:44 [SUCCESS] wigner
frequency(48)=926000000
[5] 20:47:44 [SUCCESS] neumann
frequency(48)=818000000
[6] 20:47:44 [SUCCESS] vonkarman
frequency(48)=872000000

And of course with high temperatures come CPU throttling. Clearly, with this thermal situation the run of HPL was not going to be optimal.

Giant Tiger to the rescue

Even for those in Canada this may see like a very strange reference. Giant Tiger is a discount store ch ain which sell everything from A through Z. Unfortunately the local “GT Boutique” as call it closed dow n this past January. I happened to purchase on a whim a USB powered desktop fan at the GT Boutique abou t a year ago. The idea was to help keep me cool at my keyboard during the hot summer days. But in this case, it was just what was needed to provide a bit of active cooling to the Turing Pi system.

Repeating the run of HPL with the “highly advanced active cooling” measures in place, we were able to u p the HPL results a tad while helping to preserve the life of the cluster nodes. And the results show g oing from 8.65 GFlops with passive cooling to 9.5 GFlops with the active cooling.

lsfadmin@turingpi:/opt/HPC/hpl-2.3/bin/cm3_3 $ bsub -n 20 -Is -R "select[model==CM3plus] affinity[core(
1)]" mpirun --mca btl_tcp_if_exclude lo,docker0 ./xhpl
Job <41866> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on teller>>
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   15968 
NB     :      48       96      192 
PMAP   : Row-major process mapping
P      :       4        5 
Q      :       5        4 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    48     4     5             319.43             8.4985e+00
HPL_pdgesv() start time Sun Mar  3 21:15:42 2024

HPL_pdgesv() end time   Sun Mar  3 21:21:01 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.74851526e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    96     4     5             296.94             9.1423e+00
HPL_pdgesv() start time Sun Mar  3 21:21:05 2024

HPL_pdgesv() end time   Sun Mar  3 21:26:02 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.82600703e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968   192     4     5             289.03             9.3926e+00
HPL_pdgesv() start time Sun Mar  3 21:26:06 2024

HPL_pdgesv() end time   Sun Mar  3 21:30:55 2024
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.56990081e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    48     5     4             316.20             8.5855e+00
HPL_pdgesv() start time Sun Mar  3 21:30:59 2024

HPL_pdgesv() end time   Sun Mar  3 21:36:15 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.89956630e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968    96     5     4             285.87             9.4961e+00
HPL_pdgesv() start time Sun Mar  3 21:36:19 2024

HPL_pdgesv() end time   Sun Mar  3 21:41:05 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.04113830e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       15968   192     5     4             284.69             9.5355e+00
HPL_pdgesv() start time Sun Mar  3 21:41:09 2024

HPL_pdgesv() end time   Sun Mar  3 21:45:53 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.30812017e-03 ...... PASSED
================================================================================

Finished      6 tests with the following results:
              6 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

And during the runtime, we see that no throttling occurrred and the CPU temperatures hovered in the hig h 50’s to low 60 degree Celsius range.

root@turingpi:/home/lsfadmin# parallel-ssh -h /opt/workers -i "/opt/tools/cputemp.sh"
[1] 21:41:25 [SUCCESS] kemeny
Current CPU temperature is 36.48 degrees Celsius.
[2] 21:41:25 [SUCCESS] teller
Current CPU temperature is 58.53 degrees Celsius.
[3] 21:41:25 [SUCCESS] vonkarman
Current CPU temperature is 58.00 degrees Celsius.
[4] 21:41:25 [SUCCESS] neumann
Current CPU temperature is 55.84 degrees Celsius.
[5] 21:41:25 [SUCCESS] szilard
Current CPU temperature is 61.76 degrees Celsius.
[6] 21:41:25 [SUCCESS] wigner
Current CPU temperature is 55.31 degrees Celsius.

root@turingpi:/home/lsfadmin# parallel-ssh -h /opt/workers -i "/usr/bin/vcgencmd measure_clock arm" 
[1] 21:41:29 [SUCCESS] kemeny
frequency(48)=1200000000
[2] 21:41:29 [SUCCESS] teller
frequency(48)=1200000000
[3] 21:41:29 [SUCCESS] vonkarman
frequency(48)=1200000000
[4] 21:41:29 [SUCCESS] wigner
frequency(48)=1200000000
[5] 21:41:29 [SUCCESS] neumann
frequency(48)=1200000000
[6] 21:41:29 [SUCCESS] szilard
frequency(48)=1200002000

Wrap up

I always liked the idea of a small cluster that you could easily take with you. That’s why I’m strongly considering the Turing Pi V2.5, which can work with the much more powerful CM4 omdules, among other ve ry capable modules. Budget allowing, I hope to purchase a Turing Pi V2.5 sometime in 2024. As always st ay tuned for more exciting high performance computing tales. And at the end of the day, a compute clust er in a mini ITX format isn’t a pie in the sky idea. For me, it’s a great tool for learning!