SHPC: Parallelism Basics

The final of six posts covering content from SHPC4001 and SHPC4002.

Welcome to the final post in a six-part retrospective of UWA’s SHPC units. This will be a very brief discussion of parallelism using OpenMP and OpenMPI (hereafter simply MPI) in Fortran.

OpenMP

OpenMP is an API that allows multithreading. The core way in which it does this is that a single parent thread is forked into various child threads, each with shared memory. In Fortran, we can specify such a parallel region using the standard directive-pair syntax:

!$OMP PARALLEL
  write(*,*)"Hello world!"
!$OMP END PARALLEL

while in C we need to use the pragma directive:

#pragma omp parallel
printf("Hello world!\n");

these snippets will simply print \( N \) “Gday world” messages where \( N \) is the number of default threads.

OpenMP must be explicitly loaded, either as a header in C (#include<omp.h>) or as a module in Fortran (use OMP_LIB): To compile (with GCC/GFortran), we must specify a flag to the compiler:

gcc -fopenmp -o [program] [code.c]
gfortran -fopenmp -o [program] [code.f95]

Setting and getting threads

We can manually specify the number of threads with the OMP_set_num_threads() subroutine. We can also get the ID of the current thread using OMP_get_thread_num(). Consider the following:

program test

use OMP_LIB
call OMP_SET_NUM_THREADS(4)

!$OMP PARALLEL
   if (OMP_GET_THREAD_NUM() == 0) then
     write(*,*)"Hello world!"
   end if
!$OMP END PARALLEL

end program test

This program creates 4 threads, but only the master thread (with a thread ID of 0) prints “Hello world!” The MASTER directive can be used to do the same thing, but more on that later. For now, let’s talk more about what happens to variables.

Private variables

Let’s look at this code snippet my dark matter halo assignment:

!$OMP PARALLEL DO PRIVATE(new_ll)
do i = 1,npart,1
   call ll_init(new_ll, (/0.0,0.0,0.0/), 0)
   neighbours(i)%p => new_ll
end do
!$OMP END PARALLEL DO

This code is populating an array with linked lists. First things first, PARALLEL DO is used for parallelise a for loop. Each of the threads make their way through the do loop and execute the code. Why is new_ll private? This is to stop multiple threads creating the same linked list and then adding duplicates into the array. What PRIVATE does is declare the variable to have a local scope to each thread (i.e each thread has a local copy of the variable). In contrast, SHARED is used to specify that the variable has global scope across all threads. In OpenMP, all variables are shared by default.

Thread Execution

Suppose we have a sequence of lines in our code that we only want 1 thread to use. We can use SECTIONS:

!$OMP SECTIONS
!$OMP SECTION
  write(*,*)"Hello"
!$OMP SECTION
  write(*,*)"world!"
!$OMP END SECTIONS

We can also use SINGLE to specify code that should only be executed by one thread, or MASTER to use the master thread.

Now, what if we have some sensitive data that we want to parallelise but we only ever want one thread to act at a time (such as writing to a file). We do this using CRITICAL. An example from my dark matter halo project is populating the cubic mesh with each particle.

!$OMP PARALLEL DO PRIVATE(j_x, j_y, j_z, i)
   do i = 1,npart,1
   !$OMP CRITICAL
      j_x = floor(abs(x(i) - (bds_mv(1)+0.0001))/mesh_subsize) + 1
      j_y = floor(abs(y(i) - (bds_mv(2)+0.0001))/mesh_subsize) + 1
      j_z = floor(abs(z(i) - (bds_mv(3)+0.0001))/mesh_subsize) + 1
      call ll_insert(mesh(j_x,j_y,j_z)%p, (/x(i),y(i),z(i)/), i)
   !$OMP END CRITICAL
   end do
!$OMP END PARALLEL DO

Another way to ensure that multiple threads only ever update a single memory location one thread at a time is using ATOMIC.

Synchronisation

Suppose your parallel region is composed to two chunks of code, such as (1) performing a bunch of calculations and (2) analysing the results. In such a scenario, you wouldn’t want to start analysing if the calculations have not finished. We can use the BARRIER directive to mark a point in our code where each thread must wait until all the other threads have completed. For example:

!$OMP PARALLEL
   !some important subroutine
   !$OMP BARRIER
   !code that depends on previous subroutine having been completed
!$OMP END PARALLEL

Now you must be very careful with BARRIER statements or else you could end up with a deadlock. Consider the following:

!$OMP PARALLEL
   !$OMP CRITICAL
      !some important subroutine
      !$OMP BARRIER
      !code that depends on previous subroutine having been completed
   !$OMP END CRITICAL
!$OMP END PARALLEL

because the barrier is in a critical region, only one thread will encounter it, and it will be forever stuck waiting for the other threads.

Sometimes, especially before processing some massive data structure (such as a cubic mesh of linked lists) you might want to ensure that each thread has the correct copy of all variables. We can do this using FLUSH.

MPI

The message passing interface is a communication protocol for parallel computing. MPI is used to execute a parallel program over several compute nodes (i.e different CPUs). Where OpenMP was concerned about managing things over different threads, MPI generally involves partitioning data and/or splitting tasks across over multiple processors (SPMD). Naturally, the dark matter halo project lends itself well to using MPI; the cubic mesh can be split across multiple processors, with each processor analysing a certain region within the mesh. This section is designed to give you a brief overview of how MPI works as well as its commonly used functions. For a considerably more in-depth overview, consider this MPI Tutorial.

Compiling and running

There are several implementations of MPI. I used OpenMPI but I know of others who used Intel MPI (as part of Parallel). Open MPI includes the mpifort compiler wrapper, which is needed in order to properly compile your code.

$mpifort -o [program] [code.f95]

Remember to include -fopenmp is your code also uses OpenMP. Executing is a little different

$mpirun -n [processes] [program]

Here you must specify the number of processes. What mpirun then does is create and run n copies of your program.

Setting up your world

This is the basic structure of a program that uses MPI:

program test

use MPI
implicit none
integer :: ierr, mpi_rank, mpi_size

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mpi_rank, ierr) !get process rank
call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_size, ierr) !get num of processes

if (mpi_rank == 0) then
  write(*,*) 'I am the master process'
end if

call MPI_FINALIZE(ierr)

end program test

We initialise MPI with MPI_INIT and terminate it with MPI_FINALIZE.

Interprocess communication

Let’s jump into the core feature of MPI. Suppose you want to send an array from process 0 (the master process) to process 1 (a slave process). We use MPI_SEND and MPI_RECV. Consider the following:

integer :: a(10), status(MPI_STATUS_SIZE), src, dest, flag

if (mpi_rank == src) then
  call MPI_SEND(a,size(a),MPI_INTEGER,dest,tag,MPI_COMM_WORLD,ierr)
end if

if (mpi_rank == dest) then
  call MPI_RECV(a,size(a),MPI_INTEGER,src,tag,MPI_COMM_WORLD,status,ierr)
end if

There’s a lot going on here, so lets walk through the parameters. First, we need to specify the start of the buffer (this is just the array a ). Second comes the number of elements to be sent/received; this is the size of the array. Third comes the data type; this must be an MPI-compatible type (such as MPI_INTEGER). Fourth comes the source/destination, which is an integer representing the rank (or process ID) that the data is sent from/to. Fifth is the communicator, MPI_COMM_WORLD. For MPI_RECV, we have a status parameter (did it send or not). And the last parameter ierr is required Fortran, but can be omitted for C/C++.

It is worth noting that both MPI_SEND and MPI_RECV are so-called blocking calls. Only until the data has been sent/received will the subroutine complete. This seems innocuous, but it can pretty easily lead to deadlocks. Consider the following example:

if (mpi_rank == 0) then
   call MPI_RECV(recv,size(recv),MPI_INTEGER,1,tag,MPI_COMM_WORLD,status,ierr)
   call MPI_SEND(send,size(send),MPI_INTEGER,1,tag,MPI_COMM_WORLD,ierr)
end if

if (mpi_rank == 1) then
   call MPI_RECV(recv,size(recv),MPI_INTEGER,0,tag,MPI_COMM_WORLD,status,ierr)
   call MPI_SEND(send,size(send),MPI_INTEGER,0,tag,MPI_COMM_WORLD,ierr)
end if

Here both processes are stuck waiting for a light that never comes. One solution is to swap the recv and send buffers in process 0, but another is to use the non-blocking commands MPI_ISEND and MPI_IRECV. These subroutines return immediately, so there’s no way of knowing whether they actually completed or not. For this reason, it is useful to use MPI_WAIT in order to explicitly ensure that the command has completed.

Collective communication

Suppose you want to send an array to all the other processes. We can do this with MPI_BCAST.

call MPI_BCAST(a,size(a),MPI_INTEGER,src,MPI_COMM_WORLD,ierr)

Here the fourth parameter is simply the source, i.e from which process the broadcast came from (there is no need to specify a destination as the data is being sent to all the other processes in MPI_COMM_WORLD).

Now suppose you don’t want to send the entire array to each process, but instead want to send part of the array to each process. This is where MPI_SCATTER comes in. As the name implies, it scatters the data between all the processes. So if there are 4 processes, each process receives 1/4th of the array. The order of the decomposition is by the rank of the process, so process 0 receives the first quarter, 1 the second quarter, and so on.

The inverse of scatter is MPI_GATHER. Here you have data spread across all processes that you want to stitch together into one array. By default, MPI_GATHER gathers the data into the root process. Suppose you want to then broadcast this new, single array back to all the processes. Instead of typing MPI_GATHER and MPI_BCAST, you can use MPI_ALLGATHER. What MPI_ALLGATHER is really doing is merely gathering all of the disparate data into each of the processes (instead of just the root process).

There’s a lot more to delve into in the world of parallelism, far beyond the scope of this introductory post. Hopefully you’ll find this post and the code snippets useful for kickstarting your own multithreaded projects!