Peano
Reference Configurations for Some Machines

Hamilton

Hamilton 8 is Durham's internal supercomputer. It is a system powered by AMD EPYC processors. I nevertheless prefer the Intel toolchain on the machine.

Hamilton's default oneapi module is built against a GCC version which is too old and lacks important features within its standard library. Therefore, you have to manuall load the latest gcc. This is usually prevented by the system and leads to a conflict (so that you can't have multiple GCC versions loaded). We therefore have to disable conflict checks a priori. Eventually, I end up with

module purge
module load oneapi/2024.2
export FLAVOUR_NOCONFLICT=1
module load gcc/14.2
module load intelmpi
module load tbb/2022.0
export I_MPI_CXX=icpx

For the vectorisation, we recognise that we always need

-Ofast -ffast-math

If you use mpiicpc, the MPI wrapper still refers to icpc even though icpc is officially deprecated. Therefore, I manually have to repoint it to icpx. The other options to do this (via additional script arguments, e.g.) all failed on this machine.

  • TBB runs
    ./configure CXX=icpx CC=icx 'CXXFLAGS=-O3 -ffast-math -mtune=native -march=native -fma -fomit-frame-pointer -std=c++20 -fno-exceptions -Wno-unknown-attributes -Wno-gcc-compat' LIBS="-ltbb" LDFLAGS="-L${TBBROOT}/lib" --enable-mghype --enable-exahype --enable-blockstructured --enable-finiteelements --with-multithreading=tbb_extension --enable-loadbalancing

Older settings (not tested atm)

  • MPI production runs
    ./configure CC=icpx CXX=icpx CXXFLAGS="-Ofast -g -std=c++20 -mtune=native -march=native -fma -fomit-frame-pointer -qopenmp -Wno-unknown-attributes" LDFLAGS="-fiopenmp -g" --with-multithreading=omp --enable-exahype --enable-loadbalancing --enable-blockstructured --enable-particles --with-mpi=mpiicpc FC=gfortran
  • Performance analysis runs with single node tracing
    module load vtune
    ./configure CC=icpx CXX=icpx CXXFLAGS="-Ofast -g -std=c++20 -mtune=native -march=native -fma -fomit-frame-pointer -qopenmp -Wno-unknown-attributes -I${VTUNE_HOME}/vtune/latest/include" LDFLAGS="-qopenmp -g -L${VTUNE_HOME}/vtune/latest/lib64" LIBS="-littnotify" --with-multithreading=omp --enable-exahype --enable-loadbalancing --enable-blockstructured --enable-particles --with-mpi=mpiicpc FC=gfortran --with-toolchain=itt
  • MPI performance analysis runs
    module load vtune
    ./configure CC=icpx CXX=icpx CXXFLAGS="-I${VTUNE_HOME}/vtune/latest/include -I${VT_ROOT}/include -Ofast -g -std=c++20 -mtune=native -march=native -fma -fomit-frame-pointer -qopenmp -Wno-unknown-attributes" LDFLAGS="-qopenmp -g -L${VT_LIB_DIR} -L${VTUNE_HOME}/vtune/latest/lib64" LIBS="-lVT ${VT_ADD_LIBS} -littnotify" --with-multithreading=omp --enable-exahype --enable-loadbalancing --enable-blockstructured --enable-particles --with-mpi=mpiicpc FC=gfortran --with-toolchain=itac

Further to these flags, I pause and resume the data collection manually in ExaHyPE and Swift. This is however something I do in the main() routine of the respective applications.

NCC

I recommend atm to use gpu10, gpu11 or gpu12, i.e. the A100 machines. By default, I usually work with NVC++ on this cluster, but the OneAPI software stack should work, too.

module load nsight-systems cuda llvm
module load mvapich2/2.3.5-2
./configure --enable-exahype --enable-loadbalancing-toolbox --with-multithreading=omp --with-mpi=mpicxx --with-gpu=omp CXX=clang++ CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda"

OpenMP with NVC++

I use the setup

./configure CXX=nvc++ CC=nvc++ CXXFLAGS="-O4 --std=c++17 -mp=gpu -gpu=cc80" LDFLAGS="-mp=gpu -gpu=cc80" --with-multithreading=omp --with-gpu=omp --enable-exahype --enable-blockstructured --enable-loadbalancing CPP=cpp FC=/apps/nvidia-hpc-sdk/Linux_x86_64/22.3/compilers/bin/nvfortran

on the A100 nodes.

We had some issues (compiler crashes) with NVC++ 2022.5. Those are filed with NVIDIA. For the time being, please revert to another compiler generation. Version 2022.3 works for example.

Some nodes (e.g. gpu3) have Titan XP which doesn't support OpenMP offloading; only GPUs with compute capability >=70 can offload with OpenMP. You can check the GPU model with nvidia-smi -L.

Note that there is a bug in the NVIDIA software stack at the moment. Many runs will crash with a complaint about a lack of visible devices. Ensure hence that you set the following environment before you start your experiments:

export CUDA_VISIBLE_DEVICES=0

SYCL offloading with oneAPI

I used this to configure Peano with Intel's TBB enabled

source /opt/intel/oneapi/setvars.sh
module load cuda
./configure CC=icx CXX=icpx LIBS="-ltbb" LDFLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80" CXXFLAGS="-O3 -std=c++20 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80" --with-multithreading=tbb --enable-exahype --enable-blockstructured --enable-loadbalancing --with-gpu=sycl
static void source(double *S, const double *const Q, const int CCZ4LapseType, const double CCZ4ds, const double CCZ4c, const double CCZ4e, const double CCZ4f, const double CCZ4bs, const double CCZ4sk, const double CCZ4xi, const double CCZ4itau, const double CCZ4eta, const double CCZ4k1, const double CCZ4k2, const double CCZ4k3, const double CCZ4SO)
The source term is one out of two terms that we use in our CCZ4 formulation.

Codeplay's website has some useful information on compiler flags. Here's a short explaination of the above SYCL related flags:

  • -fsycl tells the compiler to link SYCL libraries and makes it aware of SYCL code. It seems that Clang behaves slightly different to icpx, where you need -lsycl in the linker flags.
  • -fsycl-targets= allows the compiler to generate code for different devices, e.g., different vendors, host or device.
  • -Xsycl-target-backend= is required if you want to select a specific device architecture. For example, I put --cuda-gpu-arch=sm_80 after it to compile for A100 with compute capability 8.0.

It also needs at least 2023.0.0 version of oneAPI with Codeplay's nvidia GPU plugin which is not installed on NCC. Intel encourages the use of their own compiler, icpx. However, they use a lower precision by default in favour of performance, use -fp-model to change this behaviour.

I had linker errors about undefined referennces when building Peano with TBB. Just add the path to any relevant library if you experienced this.

When running the executable on the GPU, include SYCL_DEVICE_FILTER=gpu before the terminal command to explicitly tell SYCL to run the code on GPU. In the case of a runtime error, it is also useful to add SYCL_PI_TRACE=2 to check where the error was raised.

DINE

DINE (Durham Intelligent NIC Environment) is a small supercomputer attached to Cosma, which we use to prototype some of our algorithms. Access is through the Cosma loging nodes (login8.cosma.dur.ac.uk). As Cosma and DINE feature the same generation of AMD chips, you can compile on Cosma8 and ship the executable to DINE.

module load git
module load autoconf
module load python
module load intel_comp
module load compiler-rt tbb compiler
module load intel_mpi
module load gnu_comp/13.1.0
export I_MPI_CXX=icpx
./configure --enable-exahype --enable-blockstructured --enable-loadbalancing --enable-particles CC=icx CXX=icpx --with-multithreading=omp --with-mpi=mpiicpc CXXFLAGS="-std=c++20 -qopenmp -funroll-loops -Ofast"

There are a few things to consider whenever you use DINE:

  • The default Python version usually is a little bit too old, so switching to a more recent Python package is reasonable.
  • Don't use the -xHost flag in configure. The resulting code will crash on DINE. You can use
    -march=native
  • Try not to dump data into the home directory, as this one is backe up and has a quota. DINE mounts the Cosma 5 data directory (/cosma5/data/do009/myaccount) which you should use to store data files.

MAD07 (Ice Lake)

mad07 is an IceLake server with Barlow Pass memory which is added to Cosma to facilitate benchmarking for the SKA project.

module purge
module unload python
module load python/3.10.12
module load intel_comp/2023.2.0 compiler mpi
module load gnu_comp/13.1.0
./configure --enable-exahype --enable-blockstructured --enable-loadbalancing --enable-particles CC=icc CXX=icpx --with-multithreading=omp CXXFLAGS="-std=c++20 -qopenmp -funroll-loops -Ofast -xHost" LIBS=-ltbb

Previous modules/installations provided a oneAPI module. As this seems to be deprecated by Intel, one now has to load intel_comp and then the corresponding compiler and mpi.

The chip is from the Ice Lake-SP family, i.e. the generation released summer 2021. Without any further settings, OpenMP delivers a concurrency of around 11,170. This is due to two processors each featuring 28 cores with hyperthreading. I don't use hyperthreading usually, i.e. set

export OMP_NUM_THREADS=56
export OMP_PROC_BIND=true

I thought they have 2x40 cores, but my benchmarks give me an OpenMP concurrency of 11,170, which suggests it has 112 cores in total. Does that mean we have 2x28 and hyperthreading is enabled?

MAD08/MAD09 (Genoa)

These two servers feature the Genoa chipset.

module purge
module unload python
module load python/3.10.12
module load intel_comp/2023.2.0 compiler mpi
module load gnu_comp/13.1.0
./configure --enable-exahype --enable-blockstructured --enable-loadbalancing --enable-particles CC=icc CXX=icpx --with-multithreading=omp CXXFLAGS="-std=c++20 -qopenmp -funroll-loops -Ofast -xHost" LIBS=-ltbb

There are a few pitfalls with the AMD servers if you use the Intel toolchain. The Intel compiler does not recognise that the chip supports modern instruction sets. Notably, it is not aware of its AVX512 capabilities. Therefore, it produces machine code which does not use these features. You have to manually force it to produce the bytecode that you wanna have, by adding the compile flags

-O3 -fomit-frame-pointer -fstrict-aliasing -ffast-math -funroll-loops -axCOMMON-AVX512 -march=x86-64-v4 -mavx512vbmi

Swift also requires you to pass

-qopt-zmm-usage=high

to vectorisesome particular compute-heavy tasks (gravity).

Grace Hopper

The Grace Hopper build configuration is similar to any other build:

./configure CXX=nvc++ CC=nvc++ CXXFLAGS="-O4 --std=c++17 -mp=gpu -gpu=cc90,cuda12.3" LDFLAGS="-mp=gpu -gpu=cc90,cuda12.3" --with-multithreading=omp --with-gpu=omp --enable-exahype --enable-blockstructured --enable-loadbalancing

It is important to use cc90 here plus a reasonably new CUDA version. Furthermore, most systems deliver the NC++ pipeline with a rather old GCC/STL version. It might be necessary to load a newer GCC. Depending on the SLURM configuration, people might have to request GPUs explicitly when they want a Grace Hopper node.

SuperMUC-NG Phase 2

On SuperMUC-NG Phase 2, we can basically rely on the standard Intel toolchain:

module load intel
module load gcc
./configure CC=icpx CXX=icpx CXXFLAGS="-Ofast -g -std=c++20 -mtune=native -march=native -fma -fomit-frame-pointer -fiopenmp -Wno-unknown-attributes --gcc-toolchain=/dss/lrzsys/sys/spack/release/24.1.0/opt/x86_64/gcc/13.2.0-gcc-itoa7pi/" LDFLAGS="--gcc-toolchain=/dss/lrzsys/sys/spack/release/24.1.0/opt/x86_64/gcc/13.2.0-gcc-itoa7pi/ -fiopenmp -g" --with-multithreading=omp --enable-exahype --enable-loadbalancing --enable-blockstructured --enable-particles

Some remark on this setup:

  • The Intel module is at least 2024.
  • The default Intel module is built against an older GCC version. In this version, some filesystem routines and classes are still contained within std::experimental. Peano expects STL classes within std however. Therefore, we swap the STL manually. This might become obsolete with future compiler modules.
  • The toolchain also has to be known to the linker explicitly, so have to specify the path in the LDFLAGS, too.
  • Surprisingly, the code runs, once built also with system-level GCC versions. However, you have to load a recent GCC (13.0.0 here) manually before you launch any Peano application.

AMD cloud machine

This is a guideline for AMD Cloud machines that we use in hackathons, e.g. Login here works with standards ssh

ssh username@aac1.amd.com -p myport

where you have to pick an active port (7000 or 7010). We only need one module for the core GPU tests:

moduel load aomp/amdclang

(and older version is module load amdclang/17.0-6.0.0) and then use the following configuration:

./configure CFLAGS="-D__AMDGPU__ -O3 -march=native -fopenmp -lstdc++fs --offload-arch=gfx90a" CC="amdclang" CXXFLAGS="-std=c++17 -w -O3 -march=native -fopenmp -lstdc++fs --offload-arch=gfx90a" CXX="amdclang++" LDFLAGS="-march=native -fopenmp -lstdc++fs --offload-arch=gfx90a" ./configure --with-multithreading=omp --with-gpu=omp --enable-exahype --enable-loadbalancing --enable-blockstructured

A good starting point for the GPU work is benchmarks/exahype2/ccz4/performance-testbed.

Example: Local WSL Ubuntu 22.04

This configuration is meant to be a minimal build configuration when working with TBB locally. It is important to use an up-to-date c++ compiler such as icpx (e.g. g++-12 will not work with the c++-20 features used in Peano's tbb code).

First, the oneapi environment has to be set up:

load module intel

or

source [path to oneapi installation]/setvars.sh
string path
output dir and proble

Configuration:

./configure CC=icx CXX=icpx LIBS="-ltbb" CXXFLAGS="-std=c++20" --with-multithreading=tbb

With exahype2:

./configure CC=icx CXX=icpx LIBS="-ltbb" CXXFLAGS="-std=c++20" --with-multithreading=tbb --enable-exahype --enable-blockstructured --enable-loadbalancing

Leonardo

#!/bin/bash
#SBATCH -A
#SBATCH -p boost_usr_prod
#SBATCH --job-name=MyJob
#SBATCH --time=08:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
export OMP_PLACES=cores
export OMP_PROC_BIND=close
ulimit -c unlimited
ulimit -s unlimited
module purge
module load gcc
module load cmake
module load cuda
module load openmpi/4.1.6--gcc--12.2.0
module load netcdf-c/4.9.2--openmpi--4.1.6--gcc--12.2.0
module load git-lfs/3.1.2
srun --cpu-bind=verbose,cores bash -c 'env CUDA_VISIBLE_DEVICES=$SLURM_PROCID ./Executable'