Hard coded strategy for the single black hole setup. More...

#include <MulticoreOrchestration.h>

Inheritance diagram for benchmarks::exahype2::ccz4::MulticoreOrchestration:

Collaboration diagram for benchmarks::exahype2::ccz4::MulticoreOrchestration:

Public Member Functions
	MulticoreOrchestration ()

virtual	~MulticoreOrchestration ()=default

virtual void	startBSPSection (int nestedParallelismLevel) override
	Start a fork/join section. More...

virtual void	endBSPSection (int nestedParallelismLevel) override
	End fork/join section. More...

virtual int	getNumberOfTasksToHoldBack (int taskType) override
	How many tasks should be held back. More...

virtual FuseInstruction	getNumberOfTasksToFuseAndTargetDevice (int taskType) override
	Ensure right cardinality ends up on GPU. More...

virtual bool	fuseTasksImmediatelyWhenSpawned (int taskType) override
	Ensure Finite Volume tasks end up on GPU asap. More...

virtual ExecutionPolicy	paralleliseForkJoinSection (int nestedParallelismLevel, int numberOfTasks, int taskType) override
	Determine how to parallelise a fork/join section. More...

Private Attributes
int	_nestedBSPLevels
	Number of nested fork/join levels. More...

int	_maxFiniteVolumeTasks
	Maximum number of finite volume tasks in the system. More...

int	_finiteVolumeTasksInThisBSPSection
	Current number of finite volume tasks that already have been spawned. More...

Detailed Description

Hard coded strategy for the single black hole setup.

The single black hole setup is fixed setup, where we have a large domain covered by higher order patches plus a small area in the centre which is covered by Finite Volumes. The latter are very compute heavy and hence quickly become the bottleneck. So we have to process them as soon as possible. We could realise this by giving those tasks a higher priority than all other tasks, but I actually prefer to realise all scheduling within this orchestration object. Certainly, priorities might do a fine job as well.

Behaviour

The mesh can be rather complicated with AMR, and the inclusion of Finite Volume outcomes hence might be expensive. So we want to have all enclave tasks computed by the end of the first sweep if possible.
Each FV patch makes an object well-suited for a GPU. If there is an accelerator, we offload immediately.
If there is accelerator, we map each FV patch immediately to a proper task.
All other tasks are held back yet mapped onto proper tasks after the producing BSP section has terminated or until we know that all the Finite Volume tasks are out.

Realisation

I hijack getNumberOfTasksToHoldBack() to keep track of the total number of enclave tasks. This is a constant here, so I can derive it via a max function and I know that the right value will be in there after the first grid sweep.

The "magic" happens in getNumberOfTasksToHoldBack() and the documentation of this rule provides some further details.

We disable any nested parallelism. See paralleliseForkJoinSection()'s documentation.

Definition at line 55 of file MulticoreOrchestration.h.

Constructor & Destructor Documentation

◆ MulticoreOrchestration()

benchmarks::exahype2::ccz4::MulticoreOrchestration::MulticoreOrchestration ( )

Definition at line 18 of file MulticoreOrchestration.cpp.

◆ ~MulticoreOrchestration()

virtual benchmarks::exahype2::ccz4::MulticoreOrchestration::~MulticoreOrchestration ( )

virtualdefault

Member Function Documentation

◆ endBSPSection()

void benchmarks::exahype2::ccz4::MulticoreOrchestration::endBSPSection ( int nestedParallelismLevel )

overridevirtual

End fork/join section.

Decrement the counter _nestedBSPLevels. If the outermost parallel region joins, we can update _maxFiniteVolumeTasks.

See also: paralleliseForkJoinSection() which uses the nested parallelism counter.

Definition at line 32 of file MulticoreOrchestration.cpp.

◆ fuseTasksImmediatelyWhenSpawned()

bool benchmarks::exahype2::ccz4::MulticoreOrchestration::fuseTasksImmediatelyWhenSpawned ( int taskType )

overridevirtual

Ensure Finite Volume tasks end up on GPU asap.

Always true as we want to get the Finite Volume tasks to the accelerator as soon as possible. All other tasks might remain on the CPU or not. Here, it makes no big difference.

Definition at line 75 of file MulticoreOrchestration.cpp.

◆ getNumberOfTasksToFuseAndTargetDevice()

benchmarks::exahype2::ccz4::MulticoreOrchestration::FuseInstruction benchmarks::exahype2::ccz4::MulticoreOrchestration::getNumberOfTasksToFuseAndTargetDevice ( int taskType )

overridevirtual

Ensure right cardinality ends up on GPU.

If we hae a Finite Volume task, we send it off to the GPU immediately. If we have FD4 tasks, we wait until we have 16 of them and then send them off. The 16 is arbitrary. I just needed one number.

Definition at line 58 of file MulticoreOrchestration.cpp.

◆ getNumberOfTasksToHoldBack()

int benchmarks::exahype2::ccz4::MulticoreOrchestration::getNumberOfTasksToHoldBack ( int taskType )

overridevirtual

How many tasks should be held back.

If we have a GPU and we are given a FV task, we hold it back, as we know that fuseTasksImmediatelyWhenSpawned(int taskType) yields true immediately and we hence offload. If there is no GPU, we map FV tasks onto proper tasks immediately.

The routine is basically where the magic happens and where the logic of the class description is realised.

Definition at line 39 of file MulticoreOrchestration.cpp.

◆ paralleliseForkJoinSection()

tarch::multicore::orchestration::Strategy::ExecutionPolicy benchmarks::exahype2::ccz4::MulticoreOrchestration::paralleliseForkJoinSection	(	int	nestedParallelismLevel,
		int	numberOfTasks,
		int	taskType
	)

overridevirtual

Determine how to parallelise a fork/join section.

I found nested parallelism to be brutally slow, so I always return tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunSerially if more than one parallel level is embedded into each other. Otherwise, I'm happy for a section to be processed in parallel.

Definition at line 79 of file MulticoreOrchestration.cpp.

◆ startBSPSection()

void benchmarks::exahype2::ccz4::MulticoreOrchestration::startBSPSection ( int nestedParallelismLevel )

overridevirtual

Start a fork/join section.

Reset _finiteVolumeTasksInThisBSPSection is this is the start of the outermost parallel region.

See also: paralleliseForkJoinSection() which uses the nested parallelism counter.

Definition at line 24 of file MulticoreOrchestration.cpp.

Field Documentation

◆ _finiteVolumeTasksInThisBSPSection

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_finiteVolumeTasksInThisBSPSection

private

Current number of finite volume tasks that already have been spawned.

Definition at line 75 of file MulticoreOrchestration.h.

◆ _maxFiniteVolumeTasks

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_maxFiniteVolumeTasks

private

Maximum number of finite volume tasks in the system.

I don't use this value at the moment, but might want to use it for GPUs.

See also: startBSPSection()

Definition at line 70 of file MulticoreOrchestration.h.

◆ _nestedBSPLevels

int benchmarks::exahype2::ccz4::MulticoreOrchestration::_nestedBSPLevels

private

Number of nested fork/join levels.

Important for paralleliseForkJoinSection() to decide if the parallel region should actually be processed concurrently.

Definition at line 62 of file MulticoreOrchestration.h.

The documentation for this class was generated from the following files:

benchmarks/exahype2/ccz4/single-black-hole/MulticoreOrchestration.h
benchmarks/exahype2/ccz4/single-black-hole/MulticoreOrchestration.cpp

Public Member Functions

Private Attributes

Detailed Description

Behaviour

Realisation

Constructor & Destructor Documentation

◆ MulticoreOrchestration()

◆ ~MulticoreOrchestration()

Member Function Documentation

◆ endBSPSection()

◆ fuseTasksImmediatelyWhenSpawned()

◆ getNumberOfTasksToFuseAndTargetDevice()

◆ getNumberOfTasksToHoldBack()

◆ paralleliseForkJoinSection()

◆ startBSPSection()

Field Documentation

◆ _finiteVolumeTasksInThisBSPSection

◆ _maxFiniteVolumeTasks

◆ _nestedBSPLevels