Peano
Loading...
Searching...
No Matches
GPU support

ExaHyPE uses GPUs in a classic offloading mode: It takes particular tasks, ships them to the GPU and then ships them back. This happens under the hood, and neither does data reside on the accelerator permanently nor is the accelerator used all the time. Not all solvers support GPU offloading, but those that can work with (enclave) tasks either have accelerator support built in or can be tweaked relatively easy to gain something from GPU devices.

If you have a solver with GPU support and want to activate this tasking, you have to do the following steps:

  1. Configure Peano with GPU plus multithreading support.
  2. Follow the instructions around tasking.
  3. Pick a solver with GPU support.
  4. Tweak the task orchestration as discussed below.
  5. Run some performance analysis.
  6. Study the optimisation recommendations.

Using ExaHyPE with GPU support

All instructions how to enable GPU support is collected in a GPU tutorial. This tutorial only demonstrates how you enable GPU support (which is relatively easy). For proper GPU engineering, you have to study the steps below which discuss how the GPU offloading works and how you can introduce it systematically, such that you gain performance (sometimes, just throwing stuff onto the accelerator without any further thought slows the code down).

How the GPU offloading works

ExaHyPE's GPU offloading approach assumes that there are computations (cell/patch updates) which can be done without any effect on global variables. It works if and only if we can ship a task to the GPU, and then get the solution data for this task back and no other global variabled changes. We furthermore assume that the computations that fit onto the GPU have no state. They can use some global, static variables, but they cannot access the solver's state which can change over time. We rely on code parts which have no side effects and do not depend on the solver state (minus global variables).

Working without side-effects might not work for all patches (Finite Volumes and Finite Differences) or octants (RKDG and ADER-DG) in your mesh: There are pieces of the mesh which evaluate and analyse some global data, or build up global data structures. In this case, ExaHyPE only offloads the remaining, simple patches/cotants to the GPU. That is, having a solver that supports patches/cells without side effects does not mean that all cells have to be side effect-free.

The second things to keep in mind is that ExaHyPE tries not to offload a single patch to the GPU. It can do, but this is then a degenerated variant of something way more generic: The code takes a set of octants from the mesh and moves them en bloc to the accelerator. There, one compute kernel updates all of these guys in one rush. We refer to this as fused updates. A description of the fusion approach is summarised in the description of the technical architecture.

The description above gives us a clear roadmap how to offload code in ExaHyPE to the accelerator:

GPU support (programmer's view)

We offload tasks to the accelerator. The particular tasking concept that we implement in ExaHyPE is called enclave tasking. You find the vanilla paper of this idea in

@article{doi:10.1137/19M1276194,
author = {Charrier, Dominic Etienne and Hazelwood, Benjamin and Weinzierl, Tobias},
title = {Enclave Tasking for DG Methods on Dynamically Adaptive Meshes},
journal = {SIAM Journal on Scientific Computing},
volume = {42},
number = {3},
pages = {C69-C96},
year = {2020},
doi = {10.1137/19M1276194}
}

Alternatively, each lowering into Peano yields a README-xxxxx.md file. In this markup file, you also find all the information of papers that have an influence on your chosen solver configuration. Solvers that support enclave tasking typically have a distinct name. exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking is for example a solver which yields tasks.

Once you have chosen such a solver and compile the code, there should be a tasks subdirectory after you've lowered your code to C++. It makes sense to look into this C++ code from time to time to check if the generated code does make sense.

Assess suitability

Before we look into generated code or even start to implement something for GPUs, we should assess if there are any tasks that could, in theory, go to the accelerator.

Many developers jump straight into accelerator programming and then are frustrated if ExaHyPE does not use the GPU and blame the code. Most of their time, they haven't done their homework: Peano always tries to maximise the CPU usage first before it offloads to the GPU. For tiny problems or unfortunate load balancing, the code hence might not use the GPU even though the code is, in principle, GPU-ready. This is a natural consequence of our aim to minimise data movements. If we don't have to ship data to the GPU, we don't do it!

Peano provides an ecosystem to assess to which degree a code might be suitable for an accelerator even before we start the porting:

  1. Compile in the statistics mode. This requires you to switch to
    project.set_Peano4_installation(..., peano4.output.CompileMode.Stats)
  2. Run the code. I recommend that you use a fairly short run, i.e. only a few time steps. The code will provide a statistics file:
    tarch::logging::Statistics::writeToCSV(string) write statistics to file statistics-rank-0.csv (total no of entries=117)
  3. Peano's postprocessing comes along with several scripts to inspect and to postprocess the outcome. Inspect the statistics file
    python3 ~/git/Peano/python/peano4/postprocessing/inspect-statistics.py statistics-rank-0.csv
    Column | Counter
    --------|----------------
    0 | time (always used to plot)
    1 | exahype2::EnclaveBookkeeping::memory-allocations
    2 | mpi wait times
    3 | tarch::multicore::bsp-concurrency-level
    4 | tarch::multicore::spawned-tasks
    5 | tarch::multicore::taskfusion::process-fused-tasks
    6 | tarch::multicore::taskfusion::submit-fused-tasks
    7 | grid-control-events
    to find out which column gives you information about spawned tasks and all the fused tasks.
  4. Visualise the number of tasks over time:
    python3 ~/git/Peano/python/peano4/postprocessing/plot-statistics.py -c 4,5,6 -o pdf statistics-rank-0.csv
    python3 ~/git/Peano/python/peano4/postprocessing/plot-statistics.py -c 5 -pt step -o pdf statistics-rank-0.csv
    void step()

If we get pictures similar to

we face issues: Apparently the code does spawn a lot of tasks which could, in theory, be fused, but the fusion only ever takes one task at a time and deploys it. Consult the @tarch_logging "statistics page" for some more information on the plots.

In this example, we clearly don't produce tasks quickly enough. There are always some cores idling which snap away fusable tasks before they can accumulate on the host. Further to that, the scheduler changes its behaviour after some time and no tasks are produced at all anymore. These effects have to be studied and sorted out first, before we continue to try to offload to the GPU.

Enable stateless compute kernels

You next have to instruct your solver that you plan to have patch or cell updates without (write) access to the solver's internal state. They also don't alter the global state. Obviously, you can weaken these assumptions later on, but this then requires further manual work.

The Finite Volume enclave solvers for example all support stateless kernels, but the versions without enclave tasking do not support them. The solver documentation should clarify which one to select. For most of these GPU-enabled solvers, it is sufficient to pass an additional flag

pde_terms_without_state=True

into the constructor that tells the solver that we have PDE terms which have no side effect which also do not need the solver object. Consult the documentation of the respective solvers. Again, exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking is a good prototype to study:

my_solver = exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking(
name= "GPUFVSolver",
patch_size=18,
min_volume_h=0.0001,
max_volume_h=0.01,
pde_terms_without_state=True, # This part is the relevant one here
)

Provide the additional PDE terms without state

To faciliate the offloading, we have to create alternative versions of our PDE term functions that can work independent of the solver object (and in return cannot modify it). Depending on which terms we have, we'll need stateless versions of the flux, the non-conservative product and the source term. We'll always needs the eigenvalue function. Per function, keep the original one and add a second one which is

  • static;
  • accepts an additional flag of type Offloadable. This is the last argument, i.e. a static version of a function has exactly the same arguments as the non-static, default variant but then has one more argument. The last argument is solely there to be able to distinguish the static version from the normal one, as C++ cannot overload w.r.t. static vs. not static. Offloadable is automatically defined in the abstract base class which 's Python API generates.
  • is embedded into
    #if defined(OpenMPGPUOffloading)
    #pragma omp declare target
    #endif
    my new static function variants
    #if defined(OpenMPGPUOffloading) #pragma omp end declare target
    #endif
    Though you can leave the declare annotations away if you use SYCL only.

Very often, the standard flux and eigenvalue routines can invoke the static variants and you can thus eliminate code redundancies. For example, you might have the normal flux function and a static flux function with the additional Offloadable parameter, but the normal function just invokes the static cousin.

In many of our codes, we take a function like

void examples::exahype2::SSInfall::SSInfall::flux( ... ) {
logTraceInWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
compute something;
logTraceOutWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
}

and simply split it into two variants:

void examples::exahype2::SSInfall::SSInfall::flux( ... ) {
logTraceInWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
flux(Q,faceCentre,volumeH,t,normal,F,Offloadable::Yes);
logTraceOutWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
}
void examples::exahype2::SSInfall::SSInfall::flux( ..., Offloadable ) {
compute something;
}

The static version is used on the GPU. The normal one is used on the CPU yet defers immediately to the static version. Please note that GPUs are quite restrictive w.r.t. terminal outputs and assertions. You will have to remove both from your static routine versions. In the example above, the non-static version wraps the core functionality into assertions. Therefore, we have the assertions on the CPU, but the core part that goes to the GPU is free of assertions. The same holds for the logging.

Most ExaHyPE solvers allow you to insert the realisation of a routine directly from the Python code. In this case, both calls end up the AbstractMySolver class: the default variant and the stateless variant, too.

Compile, run and check

In principle, the code is now GPU ready. Compile and run.

Tailor kernel choice

Most ExaHyPE solvers employ at least three different kernel variations: a normal one, one that vectorises aggressively on the host, and one that offloads to the GPU. If you use the default configuration, all should be set to some reasonable defaults. If this is not the case, you can manually alter

    self._fused_compute_kernel_call_cpu
    self._fused_compute_kernel_call_gpu

These values hold C++ strings, which are typically method invocations to the corresponding computing kernels. If you play around with these values, you might want to study further GPU realisation details and dive into the code.

Fuse tasks

Finally, we might want to tailor the fusion of tasks. For example, you might want to offload to the GPU if and only if there are enough tasks that you can bundle (fuse) into one meta task. Of you might want to use different GPUs depending on the task context. These decisions are made by the multithreading orchestration (among other things).

You can switch the orchestration manually in your C++ file. By default, the initialisation of ExaHyPE does this:

tarch::multicore::setOrchestration( tarch::multicore::orchestration::createDefaultStrategy() );

You can alter this one. A more elegant variant is likely that you use exahype2.Project.set_multicore_orchestration() to let the lowering into Peano automatically insert an appropriate setOrchestration() call. The namespace tarch::multicore::orchestration holds a bundle of different orchestration strategies that you can use to tweak your code. A good starting point is

my_project.set_multicore_orchestration( "new otherOrchestration(...)" )

Advanced techniques

Select which octants (cells or patches) to ship to the GPU and back

If only some patches/cells can be offloaded to the GPU, then you can redefine the routine

virtual bool patchCanUseStatelessPDETerms(
const tarch::la::Vector<DIMENSIONS,double>& x,
const tarch::la::Vector<DIMENSIONS,double>& h,
double t,
double dt
) const;

in your solver. By default, this routine returns true. This default is written in the AbstractXXX solver. But nothing stops you from redefining the function in your particular solver subclass.

Write your own orchestration

Very advanced codes write their own orchestration which ships particular tasks specifically to the GPU. Through this, you can tailor the task execution pattern towards your specific needs. Notably, the orchestration is asked per task type whether and where to ship the task.

Further reading

Adding GPU support to your ExaHyPE source code (programmer view)

This page discusses how ExaHyPE supports SIMT and SIMD compute kernels. ExaHyPE uses GPUs in a classic offloading, i.e. GPUs as accelerators. This means that it takes particular tasks, ships them to the GPU and then ships them back. This happens under the hood, and neither does data reside on the accelerator permanently nor is the accelerator used all the time.

The whole approach assumes that there are computations (cell/patch updates) which can be done without any effect on global variables. It works if and only if we can ship a task to the GPU, and then get the solution data for this task back and no other global variabled changes. We furthermore assume that the computations that fit onto the GPU have no state. They can use some global, static variables, but they cannot access the solver's state which can change over time. We rely on code parts which have no side effects and do not depend on the solver state (minus global variables).

The same argument holds for aggressive vectorisation: Vectorisation works best if the computations on a patch do not have any side effects. Basically, ExaHyPE assumes that GPU's SIMT and CPU's SIMD are two realisations of the same pattern.

Working without side-effects might not work for all patches: There are always patches which evaluate and analyse some global data, or build up global data structures. In this case, ExaHyPE only offloads the remaining, simple patches to the GPU. That is, having a solver that supports patches/cells without side effects does not mean that all cells have to be side effect-free.

Before you dive into this technical description of how GPU offloading (and very aggressive vectorisation) work, it makes sense to read through the tutorial on GPU offloading.

Compile with GPU support

Configure ExaHyPE with the argument --with-gpu=xxx and select an appropriate GPU programming model. Rebuild the whole Peano core. This includes the ExaHyPE libraries.

Enable stateless compute kernels

The Finite Volume enclave solvers for example all support GPUs, but the versions without enclave tasking do not support them. The solver documentation should clarify which one to select. For most of these GPU-enabled solvers, it is sufficient to pass an additional flag

pde_terms_without_state=True

for the FV variants, e.g.) that tells the solver that we have PDE terms which have no side effect which also do not need the solver object. Consult the documentation of the respective solvers. Examples are

  • exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking
  • yet to be added

Provide the additional PDE terms without state

To faciliate the offloading, we have to create alternative versions of our PDE term functions that can work independent of the solver object (and in return cannot modify it). Depending on which terms we have, we'll need stateless versions of the flux, the non-conservative product and the source term. We'll always needs the eigenvalue function. Per function, keep the original one and add a second one which is

  • static;
  • accepts an additional flag of type Offloadable. This is the last argument, i.e. a static version of a function has exactly the same arguments as the non-static, default variant but then has one more argument. The last argument is solely there to be able to distinguish the static version from the normal one, as C++ cannot overload w.r.t. static vs. not static. Offloadable is automatically defined in the abstract base class which 's Python API generates.

Most ExaHyPE solvers allow you to insert the realisation of a routine directly from the Python code. In this case, the call ends up the AbstractMySolver class. Most classes call the corresponding routine set_implementation(). If you use this one, the code generator usually does not distinguish the two callback flavours and uses the same code snippet for the normal as well as the vectorised version. However, it is likely that the files end up in the cpp file of the abstract super class. That is, the compiler will not be able to inline anything. If your compiler struggles with the inlining, you might be forced either

  • to dive into linker-time optimisation - Intel, for example, did allow you to run interprocedural optimisation (ipo) on bundles of object files; or
  • to switch to a manual implementation and to move the GPU routines into the headers.

Alter compute kernel

ExaHyPE solvers which support GPUs/stateless operators typically host three different compute kernels. These kernels are plain C++ function calls, i.e. strings, on the Python level.

  • self._compute_kernel_call is a C++ function call for the normal task. This guy is called whenever you hit an octant, i.e. Finite Volume patch or a DG cell. It calls the virtual flux, eigenvalue, ... functions on the solver object.
  • self._fused_compute_kernel_call_cpu is a C++ function call that ExaHyPE uses whenever it encounters a set of octants in the tree which can be processed embarrassingly parallel as they all invoke the same PDE and none of them alters the state of the underlying solver object. So the compute kernels identified by this string can call the static flux and eigenvalue instead of the virtual function.
  • self._fused_compute_kernel_call_gpu is the equivalent to _fused_compute_kernel_call_cpu which the code uses when it decides to offload a bunch of cells to an accelerator.

which automatically picks a working default backend depending on your configuration.

While this gives you a working version quickly, you might want to switch to a tailored variant for your particular solver. This can be done by changing the actual solver kernel:

mysolver._fused_compute_kernel_call_gpu = exahype2.kerneldsl.api.compile( .... )

Mask out cells which are not a fit

If only some patches/cells can be offloaded to the GPU, then you can redefine the routine

virtual bool patchCanUseStatelessPDETerms(
const tarch::la::Vector<DIMENSIONS,double>& patchCentre,
const tarch::la::Vector<DIMENSIONS,double>& patchH, double t, double dt
) const;

in your solver. By default, this routine returns true always. Here's the clue: This is a normal function, i.e. you can use the solver's state and make the result depend on this one.

patchCanUseStatelessPDETerms() yields, by default, true. So all compute kernels end up on the GPU if they are embedded into enclave tasks. You migth want to alter this, and keep some kernels on the host.

Sanity checks

The sections below discuss what you should see in the code.

Introduce aggressively vectorised kernel

Once you have a GPU-ready solver, it makes sense to doublecheck quickly if the solver really dispatches the correct compute kernels. For this, we first study the generated task. Stateless routines are also very useful on the CPU, as we can aggressively vectorise over them. Therefore, the task should look similar to

double benchmarks::exahype2::ccz4::tasks::CCZ4FVEnclaveTask::applyKernelToCell(...) {
...
if (repositories::instanceOfCCZ4FV.patchCanUseStatelessPDETerms(
marker.getX(),
marker.getH(),
t,
dt
)) {
::exahype2::fv::rusanov::timeStepWithRusanovPatchwiseHeapStateless<
...
>(patchData);
} else {
::exahype2::fv::rusanov::timeStepWithRusanovPatchwiseHeapFunctors<
...
>(patchData,
...
);
}
}

The kernels might look completely different, but the principle is clear: Whenever we hit a cell, we check if patchCanUseStatelessPDETerms() holds. If this is not the case, we use the default version. However, if it holds, then we invoke the specialised version which assumes that there is a version with static (stateless) calls.

Fused compute kernels

Scroll further down until you see the fuse() routine. This one is again very simple to digest:

bool benchmarks::exahype2::ccz4::tasks::CCZ4FVEnclaveTask::fuse(const std::list<Task*>& otherTasks, int targetDevice) {
...
#if defined(WITH_GPU_SYCL)
if (targetDevice >= 0 or targetDevice == Host) {
foundOffloadingBranch = true;
::exahype2::fv::rusanov::sycl::timeStepWithRusanovPatchwiseHeapStateless<
...
>(targetDevice, patchData);
}
#endif

We see here, that SYCL can run on both the host and the device, whereas we pick the OpenMP version if and only if we have picked a valid device.

If this call logic makes no sense, most solvers have attributes such as

    self._fused_compute_kernel_call_cpu
    self._fused_compute_kernel_call_gpu

to switch to the modified kernel realisation variant. Typically, there are factory mechanisms to alter those guys:

my_solver._fused_compute_kernel_call_gpu = exahype2.solvers.fv.rusanov.kernels.create_compute_Riemann_kernel_for_Rusanov(
my_solver._flux_implementation,
my_solver._ncp_implementation,
my_solver._source_term_implementation,
compute_max_eigenvalue_of_next_time_step=True,
solver_variant = exahype2.solvers.fv.rusanov.kernels.SolverVariant.Accelerator,
kernel_variant = exahype2.solvers.fv.rusanov.kernels.KernelVariant.GenericAoS
)