Do not use a blocking send or receive, even if routines are logically blocking. Instead, use a non-blocking routine and poll the termination via MPI_Test. While you poll, invoke receiveDanglingMessages() on the services to allow to free the queue from other messages such as load balancing requests which might be in your way.
Register your own routine as a service (see tarch::services::Service) and equip it with a receiveDanglingMessages() routine which probes the MPI queues. If a message of relevance is in the MPI queue, receive it. If you can't react directly, buffer it internally. This routine ensures that the MPI buffers can't overflow. Overflows force MPI to fall back from eager to rendeszouz communication which can lead to deadlocks.
Do never impose any synchronisation unless absolutely required. Notably, try to avoid synchronisation within the action sets! In the ideal case, you implement a Jacobi-like data exchange: The action sets send stuff out. There's a service (see above) which receives the data. This data is then used by the action set (or another one) in the subsequent grid pass.
If you have a global operation, use a non-blocking collective and wait for the collective's test in the subsequent grid run-through.

If a message consists of multiple parts, you might have to fuse it into one message to avoid that the first part of a message is received and queue polling (on another thread) then grabs the second part misinterpreting it for the first part of yet another message.

Exchange global data between ranks (reduction)

If you reduce data betwen the ranks, I recommend that you use the tarch wrappers. They have two advantages: They automatically provide deadlock tracking, and they poll the MPI queues, i.e. if the code stalls as too many other messages for other purposes (global load balancing, plotting, ...) drop in, and remove these messages from the buffers such that MPI can make progress. The usage is detailed in tarch::mpi::Rank::allReduce(), but the code usually resembles:

double myBuffer = _myVariable;
 
tarch::mpi::Rank::getInstance().allReduce(
  &myBuffer,
  &_myVariable,
  1, MPI_DOUBLE,
  MPI_MIN,
  [&]() -> void { tarch::services::ServiceRepository::getInstance().receiveDanglingMessages(); }
);

Synchronisation

I also offer a boolean semaphore accross all MPI ranks. This one is quite expensive, as it has to lock both all threads and all ranks and thus send quite some messages forth and back.

The usage is similar to the shared memory lock, but you have to give the semaphore a unique name (you might construct a string using the __LINE__ and __FILE__ macro). It is the name that uses to ensure that the semaphore is unique accross all ranks. In this way, the semaphore is similar to an OpenMP named critical section. I recommend to avoid MPI barrier whenever possible. Details can be found in tarch::mpi::Rank::barrier().

Table of Contents

Exchange global data between ranks (reduction)

Synchronisation