Peano
benchmarks::exahype2::ccz4::MulticoreOrchestration Class Reference

Hard coded strategy for the single black hole setup. More...

#include <MulticoreOrchestration.h>

Inheritance diagram for benchmarks::exahype2::ccz4::MulticoreOrchestration:
Collaboration diagram for benchmarks::exahype2::ccz4::MulticoreOrchestration:

Public Member Functions

 MulticoreOrchestration ()
 
virtual ~MulticoreOrchestration ()=default
 
virtual void startBSPSection (int nestedParallelismLevel) override
 Start a fork/join section. More...
 
virtual void endBSPSection (int nestedParallelismLevel) override
 End fork/join section. More...
 
virtual int getNumberOfTasksToHoldBack (int taskType) override
 How many tasks should be held back. More...
 
virtual FuseInstruction getNumberOfTasksToFuseAndTargetDevice (int taskType) override
 Ensure right cardinality ends up on GPU. More...
 
virtual bool fuseTasksImmediatelyWhenSpawned (int taskType) override
 Ensure Finite Volume tasks end up on GPU asap. More...
 
virtual ExecutionPolicy paralleliseForkJoinSection (int nestedParallelismLevel, int numberOfTasks, int taskType) override
 Determine how to parallelise a fork/join section. More...
 

Private Attributes

int _nestedBSPLevels
 Number of nested fork/join levels. More...
 
int _maxFiniteVolumeTasks
 Maximum number of finite volume tasks in the system. More...
 
int _finiteVolumeTasksInThisBSPSection
 Current number of finite volume tasks that already have been spawned. More...
 

Detailed Description

Hard coded strategy for the single black hole setup.

The single black hole setup is fixed setup, where we have a large domain covered by higher order patches plus a small area in the centre which is covered by Finite Volumes. The latter are very compute heavy and hence quickly become the bottleneck. So we have to process them as soon as possible. We could realise this by giving those tasks a higher priority than all other tasks, but I actually prefer to realise all scheduling within this orchestration object. Certainly, priorities might do a fine job as well.

Behaviour

  • The mesh can be rather complicated with AMR, and the inclusion of Finite Volume outcomes hence might be expensive. So we want to have all enclave tasks computed by the end of the first sweep if possible.
  • Each FV patch makes an object well-suited for a GPU. If there is an accelerator, we offload immediately.
  • If there is accelerator, we map each FV patch immediately to a proper task.
  • All other tasks are held back yet mapped onto proper tasks after the producing BSP section has terminated or until we know that all the Finite Volume tasks are out.

Realisation

I hijack getNumberOfTasksToHoldBack() to keep track of the total number of enclave tasks. This is a constant here, so I can derive it via a max function and I know that the right value will be in there after the first grid sweep.

The "magic" happens in getNumberOfTasksToHoldBack() and the documentation of this rule provides some further details.

We disable any nested parallelism. See paralleliseForkJoinSection()'s documentation.

Definition at line 55 of file MulticoreOrchestration.h.

Constructor & Destructor Documentation

◆ MulticoreOrchestration()

benchmarks::exahype2::ccz4::MulticoreOrchestration::MulticoreOrchestration ( )

Definition at line 18 of file MulticoreOrchestration.cpp.

◆ ~MulticoreOrchestration()

virtual benchmarks::exahype2::ccz4::MulticoreOrchestration::~MulticoreOrchestration ( )
virtualdefault

Member Function Documentation

◆ endBSPSection()

void benchmarks::exahype2::ccz4::MulticoreOrchestration::endBSPSection ( int  nestedParallelismLevel)
overridevirtual

End fork/join section.

Decrement the counter _nestedBSPLevels. If the outermost parallel region joins, we can update _maxFiniteVolumeTasks.

See also
paralleliseForkJoinSection() which uses the nested parallelism counter.

Definition at line 32 of file MulticoreOrchestration.cpp.

◆ fuseTasksImmediatelyWhenSpawned()

bool benchmarks::exahype2::ccz4::MulticoreOrchestration::fuseTasksImmediatelyWhenSpawned ( int  taskType)
overridevirtual

Ensure Finite Volume tasks end up on GPU asap.

Always true as we want to get the Finite Volume tasks to the accelerator as soon as possible. All other tasks might remain on the CPU or not. Here, it makes no big difference.

Definition at line 75 of file MulticoreOrchestration.cpp.

◆ getNumberOfTasksToFuseAndTargetDevice()

benchmarks::exahype2::ccz4::MulticoreOrchestration::FuseInstruction benchmarks::exahype2::ccz4::MulticoreOrchestration::getNumberOfTasksToFuseAndTargetDevice ( int  taskType)
overridevirtual

Ensure right cardinality ends up on GPU.

If we hae a Finite Volume task, we send it off to the GPU immediately. If we have FD4 tasks, we wait until we have 16 of them and then send them off. The 16 is arbitrary. I just needed one number.

Definition at line 58 of file MulticoreOrchestration.cpp.

◆ getNumberOfTasksToHoldBack()

int benchmarks::exahype2::ccz4::MulticoreOrchestration::getNumberOfTasksToHoldBack ( int  taskType)
overridevirtual

How many tasks should be held back.

If we have a GPU and we are given a FV task, we hold it back, as we know that fuseTasksImmediatelyWhenSpawned(int taskType) yields true immediately and we hence offload. If there is no GPU, we map FV tasks onto proper tasks immediately.

The routine is basically where the magic happens and where the logic of the class description is realised.

Definition at line 39 of file MulticoreOrchestration.cpp.

◆ paralleliseForkJoinSection()

tarch::multicore::orchestration::Strategy::ExecutionPolicy benchmarks::exahype2::ccz4::MulticoreOrchestration::paralleliseForkJoinSection ( int  nestedParallelismLevel,
int  numberOfTasks,
int  taskType 
)
overridevirtual

Determine how to parallelise a fork/join section.

I found nested parallelism to be brutally slow, so I always return tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunSerially if more than one parallel level is embedded into each other. Otherwise, I'm happy for a section to be processed in parallel.

Definition at line 79 of file MulticoreOrchestration.cpp.

◆ startBSPSection()

void benchmarks::exahype2::ccz4::MulticoreOrchestration::startBSPSection ( int  nestedParallelismLevel)
overridevirtual

Start a fork/join section.

Reset _finiteVolumeTasksInThisBSPSection is this is the start of the outermost parallel region.

See also
paralleliseForkJoinSection() which uses the nested parallelism counter.

Definition at line 24 of file MulticoreOrchestration.cpp.

Field Documentation

◆ _finiteVolumeTasksInThisBSPSection

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_finiteVolumeTasksInThisBSPSection
private

Current number of finite volume tasks that already have been spawned.

Definition at line 75 of file MulticoreOrchestration.h.

◆ _maxFiniteVolumeTasks

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_maxFiniteVolumeTasks
private

Maximum number of finite volume tasks in the system.

I don't use this value at the moment, but might want to use it for GPUs.

See also
startBSPSection()

Definition at line 70 of file MulticoreOrchestration.h.

◆ _nestedBSPLevels

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_nestedBSPLevels
private

Number of nested fork/join levels.

Important for paralleliseForkJoinSection() to decide if the parallel region should actually be processed concurrently.

Definition at line 62 of file MulticoreOrchestration.h.


The documentation for this class was generated from the following files: