Peano
Multicore programming

To compile with multicore support, you have to invoke the configure script with the option

--with-multithreading=value

where value is

  • cpp. This adds support through C++14 threads.
  • tbb. This adds support through Intel's Threading Building Blocks. If you use this option, you first have to ensure that your CXXFLAGS and LDFLAGS point to the right include or library, respectively, directories. LDFLAGS also has to compromise either -ltbb or -tbb_debug.
  • tbb_extension. Is an alternative TBB version that does not require the very latest TBB extensions.
  • openmp. This adds OpenMP support. We currently develop against OpenMP 4.x though some of our routines use OpenMP target and thus are developed against OpenMP 5.
  • sycl. We have a SYCL support for the multithreading, through, within the Intel toolchain, it might be more appropriate to combine sycl on the GPU with the tbb backend for multithreading.

Our vision is that each code should be totally independent of the multithreading implementation chosen. Indeed, Peano 4 itself does not contain any direct multithreading library calls. It solely relies on the classes and functions from tarch::multicore.



As of today (2025-03-29), OpenMP's tasking implementation is not 100% stable. As long as you don't use task dependencies excessively, you should be fine, but if you use complex task graphs, you might run into occasional seg faults in the OpenMP runtime.
As of today (2025-03-29), TBB's dynamic task graphs are only available as preview in UXL's internal repositories. So you either have to install those extensions, or you pick tbb_extension.

Administering the multithreaded environment

The central instance managing the threads on a system is tarch::multicore::Core. This is a singleton and the name thus is slightly wrong. It does not really represent one core but rather represents the landscape of cores. You can setup the multithreading environment through Core's configure() routine, but this is optional. Indeed, multithreading should work without calling configure() at all. Each multitheading backend offers its own realisation of the Core class.

Avoid race conditions

For multithreaded code, it is important that the code can lock (protect) code regions and free them. For this, the multithreading layer offers different semaphores. Each multithreading backend maps these logical concepts onto its internal synchronisation mechanism. Usually, I use the semaphores through lock objects. As they rely on the semaphore implementations, they are generic and work for any backend.

So the standard use is that you have

static tarch::multicore::BooleanSemaphore myFancySemaphore;

and then you use it by locking it:

{ // scope starts
tarch::multicore::Lock lock(myFancySemaphore);
// [all of this is protected]
} // end of scope destroys lock object and therefore frees semaphore

The class tarch::multicore::BooleanSemaphore should be studied for a more detailed documentation of the thread locks.

The plain semaphore does not work recursively, i.e. if a thread locks a semaphore and then locks again, it deadlocks itself. You can avoid this by switching to a recursive semaphore. They are more expensive yet allow that a thread locks the same semaphore repeatedly.

Each semaphore also supports a try_lock() operation, so you can do busy polling. try_lock's are not supported by the tarch::multicore::Lock object. You have to access the semaphore directly and manually free it later, whereas the Lock object automatically frees the semaphore in its destructor (unless you do so before that manually).

Atomics

At the moment, the tarch does not provide a backend-independent abstraction of atomics. We hope that such a thing would be obsolete anyway, as C++'s std::atomic works with all backends. If this is not the case and if you need atomics, you have to write backend-specific variants protected by preprocessor macros (see below).

Special semaphores

If you have a piece of code that locks a sempahore recursively, the second call to lock will deadlock. You will have to use the tarch::multicore::RecursiveLock.

Sometimes, it is fine for many threads to access a code block concurrently as they only read data. At the same time, if one thread writes some data, noone else should read or write concurrently. In this case, you have to use tarch::multicore::MultiReadSingleWriteSemaphore.

Write bespoke multithreaded variants

The namespace tarch::multicore provides a set of preprocessor macros and further information how to write bespoke code for codes that have to distinguish single-threaded realisation from bespoke variants.

Tasks

All the tasking is modelled through the class tarch::multicore::Task. That is, Peano expects each task to be a subclass of this guy. However, there's already a specialisation of the class accepting a functor if you prefer to work with lambdas.

The actually task submission, dependency tracking, and synchronisation then is all realised through functions within the namespace tarch::multicore. Consult the function tarch::multicore::spawnTask().

Task types

Peano models all of its interna as tasks. Each Peano 4 task is a subclass of tarch::multicore::Task. However, these classes might not be mapped 1:1 onto native tasks. In line with other APIs such as OneTBB, we distinguish different task types or task graph types, respectively:

  • Tasks. The most generic type of tasks is submitted via spawnTask(). Each task can be assigned a unique number and incoming dependencies. The number in return can be used to specify outgoing dependencies. These generic normal tasks are spawned through tarch::multicore::spawnTask().
  • Fork-join tasks. These are created via tarch::multicore::spawnAndWait() which accepts a sequence of tasks. They are all run in parallel but then wait for each other, i.e. we define a tree (sub)task graph. With fork-join calls, we mirror the principles behind bulk-synchronous programming (BSP).
  • Fusable tasks. A subtype of the normal tasks.

Tasks with dependencies

In Peano, task DAGs are built up along the task workflow. That is, each task that is not used within a fork-join region or is totally free is assigned a unique number when we spawn it.

Whenever we define a task, we can also define its dependencies. This is a sole completion dependency: you tell the task system which task has to be completed before the currently submitted one is allowed to start. A DAG thus can be built up layer by layer. We start with the first task. This task might be immediately executed - we do not care - and then we continue to work our way down through the graph adding node by node.

In line with OpenMP and TBB - where we significantly influenced the development of the dynamic task API - outgoing dependencies should be declared before we use them.

Task fusion

The core idea behind task fusion is that there's no right, generic, global task granularity on modern HPC systems. Instead, we may assume that

  • too small tasks are bad as they induce too much of an overhead. Nevertheless, they might be the right modelling tool/language;
  • smallish tasks perform good on CPUs;
  • GPUs require rather huge tasks.

In Peano, we try to use as small tasks as possible, but equip the runtime to bundle (fuse) these tasks later on at runtime. This is different to other codes which require codes to come up with the large, fused tasks a priori - for example by working with larger patches.

Tasks that can be fused have to implement the canFuse() predicate, and they have to provide a function which can handle a set of tasks of the same type. Details on how the fusion is implemented are provided by tarch::multicore::taskfusion.

Implementation

Most implementation details can be found in the namespace documentation.