The whole approach assumes that there are computations (cell/patch updates) which can be done without any effect on global variables. It works if and only if we can ship a task to the GPU, and then get the solution data for this task back and no other global variabled changes. We furthermore assume that the computations that fit onto the GPU have no state. They can use some global, static variables, but they cannot access the solver's state which can change over time. We rely on code parts which have no side effects and do not depend on the solver state (minus global variables).

The same argument holds for aggressive vectorisation: Vectorisation works best if the computations on a patch do not have any side effects. Basically, ExaHyPE assumes that GPU's SIMT and CPU's SIMD are two realisations of the same pattern.

Working without side-effects might not work for all patches: There are always patches which evaluate and analyse some global data, or build up global data structures. In this case, ExaHyPE only offloads the remaining, simple patches to the GPU. That is, having a solver that supports patches/cells without side effects does not mean that all cells have to be side effect-free.

Before you dive into this technical description of how GPU offloading (and very aggressive vectorisation) work, it makes sense to read through the tutorial on GPU offloading.

Compile with GPU support

Configure ExaHyPE with the argument --with-gpu=xxx and select an appropriate GPU programming model. Rebuild the whole Peano core. This includes the ExaHyPE libraries.

Todo:

More details on teh maturity and working of coce

Discussion how to introduce your own GPU-specific kernels

Enable stateless compute kernels

The Finite Volume enclave solvers for example all support GPUs, but the versions without enclave tasking do not support them. The solver documentation should clarify which one to select. For most of these GPU-enabled solvers, it is sufficient to pass an additional flag

pde_terms_without_state=True

euler.True

True

Definition: euler.py:58

for the FV variants, e.g.) that tells the solver that we have PDE terms which have no side effect which also do not need the solver object. Consult the documentation of the respective solvers. Examples are

exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking
yet to be added

Provide the additional PDE terms without state

To faciliate the offloading, we have to create alternative versions of our PDE term functions that can work independent of the solver object (and in return cannot modify it). Depending on which terms we have, we'll need stateless versions of the flux, the non-conservative product and the source term. We'll always needs the eigenvalue function. Per function, keep the original one and add a second one which is

static;
accepts an additional flag of type Offloadable. This is the last argument, i.e. a static version of a function has exactly the same arguments as the non-static, default variant but then has one more argument. The last argument is solely there to be able to distinguish the static version from the normal one, as C++ cannot overload w.r.t. static vs. not static. Offloadable is automatically defined in the abstract base class which 's Python API generates.
is embedded into
#if defined(OpenMPGPUOffloading)

#pragma omp declare target

#endif

my new static function variants
#if defined(OpenMPGPUOffloading)

#pragma omp end declare target

#endif

Though you can leave these annotations away if you use SYCL only.

Most ExaHyPE solvers allow you to insert the realisation of a routine directly from the Python code. In this case, the call ends up the AbstractMySolver class. Most classes call the corresponding routine set_implementation(). If you use this one, the code generator usually does not distinguish the two callback flavours and uses the same code snippet for the normal as well as the vectorised version. However, it is likely that the files end up in the cpp file of the abstract super class. That is, the compiler will not be able to inline anything. If your compiler struggles with the inlining, you might be forced either

to dive into linker-time optimisation - Intel, for example, did allow you to run interprocedural optimisation (ipo) on bundles of object files; or
to switch to a manual implementation and to move the GPU routines into the headers.

Alter compute kernel

ExaHyPE solvers which support GPUs/stateless operators typically host three different compute kernels. These kernels are plain C++ function calls, i.e. strings, on the Python level.

self._compute_kernel_call is a C++ function call for the normal task. This guy is called whenever you hit an octant, i.e. Finite Volume patch or a DG cell. It calls the virtual flux, eigenvalue, ... functions on the solver object.
self._fused_compute_kernel_call_cpu is a C++ function call that ExaHyPE uses whenever it encounters a set of octants in the tree which can be processed embarrassingly parallel as they all invoke the same PDE and none of them alters the state of the underlying solver object. So the compute kernels identified by this string can call the static flux and eigenvalue instead of the virtual function.
self._fused_compute_kernel_call_gpu is the equivalent to _fused_compute_kernel_call_cpu which the code uses when it decides to offload a bunch of cells to an accelerator.

By default, these variants should point to reasonable variants once you switch on stateless operators and the variants are typically constructed in a way that they distinguish the various backends we support. That is, they yield a string of the form

#if defined(WITH_GPU_SYCL)
exahype::fv::rusanov::sycl::timeStepWithRusanovPatchwiseStateless(...)
#else if defined(WITH_GPU_OMP_TARGET)
...
#endif

which automatically picks a working default backend depending on your configuration.

While this gives you a working version quickly, you might want to switch to a tailored variant for your particular solver. This can be done by changing the actual solver kernel:

mysolver._fused_compute_kernel_call_gpu = exahype2.kerneldsl.api.compile( .... )

exahype2

Definition: CellFaceData.h:13

Mask out cells which are not a fit

If only some patches/cells can be offloaded to the GPU, then you can redefine the routine

virtual bool patchCanUseStatelessPDETerms( 
  const tarch::la::Vector<DIMENSIONS,double>&  patchCentre, 
  const tarch::la::Vector<DIMENSIONS,double>& patchH, double t, double dt 
) const;

in your solver. By default, this routine returns true always. Here's the clue: This is a normal function, i.e. you can use the solver's state and make the result depend on this one.

patchCanUseStatelessPDETerms() yields, by default, true. So all compute kernels end up on the GPU if they are embedded into enclave tasks. You migth want to alter this, and keep some kernels on the host.

Sanity checks

The sections below discuss what you should see in the code.

Introduce aggressively vectorised kernel

Todo:: This will change completely as we switch to xDSL. Notably, we will not support multiple target GPUs in one source code anymore.

Once you have a GPU-ready solver, it makes sense to doublecheck quickly if the solver really dispatches the correct compute kernels. For this, we first study the generated task. Stateless routines are also very useful on the CPU, as we can aggressively vectorise over them. Therefore, the task should look similar to

double benchmarks::exahype2::ccz4::tasks::CCZ4FVEnclaveTask::applyKernelToCell(...) {
  ...
  
  if (repositories::instanceOfCCZ4FV.patchCanUseStatelessPDETerms(
      marker.x(),
      marker.h(),
      t,
      dt
  )) {
    ::exahype2::fv::rusanov::timeStepWithRusanovPatchwiseHeapStateless<
      ...
     >(patchData);
  } else {
     ::exahype2::fv::rusanov::timeStepWithRusanovPatchwiseHeapFunctors<
      ...
     >(patchData,
     ...
     );
  }
}

The kernels might look completely different, but the principle is clear: Whenever we hit a cell, we check if patchCanUseStatelessPDETerms() holds. If this is not the case, we use the default version. However, if it holds, then we invoke the specialised version which assumes that there is a version with static (stateless) calls.

Fused compute kernels

Todo:: This will change completely as we switch to xDSL. Notably, we will not support multiple target GPUs in one source code anymore.

Scroll further down until you see the fuse() routine. This one is again very simple to digest:

bool benchmarks::exahype2::ccz4::tasks::CCZ4FVEnclaveTask::fuse(const std::list<Task*>& otherTasks, int targetDevice) {
  ...
  #if defined(WITH_GPU_OMP_TARGET)
    if (targetDevice >= 0) {
      foundOffloadingBranch = true;
      ::exahype2::fv::rusanov::omp::timeStepWithRusanovPatchwiseHeapStateless<
      ...
      >(targetDevice, patchData);
    }
  #endif
 
  #if defined(WITH_GPU_SYCL)
    if (targetDevice >= 0 or targetDevice == Host) {
      foundOffloadingBranch = true;
      ::exahype2::fv::rusanov::sycl::timeStepWithRusanovPatchwiseHeapStateless<
        ...
      >(targetDevice, patchData);
    }
  #endif

We see here, that SYCL can run on both the host and the device, whereas we pick the OpenMP version if and only if we have picked a valid device.

If this call logic makes no sense, most solvers have attributes such as

    self._fused_compute_kernel_call_cpu
    self._fused_compute_kernel_call_gpu

to switch to the modified kernel realisation variant. Typically, there are factory mechanisms to alter those guys:

my_solver._fused_compute_kernel_call_gpu  = exahype2.solvers.fv.rusanov.kernels.create_compute_Riemann_kernel_for_Rusanov(
  my_solver._flux_implementation, 
  my_solver._ncp_implementation, 
  my_solver._source_term_implementation, 
  compute_max_eigenvalue_of_next_time_step=True, 
  solver_variant         = exahype2.solvers.fv.rusanov.kernels.SolverVariant.Accelerator,
  kernel_variant         = exahype2.solvers.fv.rusanov.kernels.KernelVariant.GenericAoS
)

Table of Contents