Parallel programming

Processes and Threads

Parallel processing is normally based either on using multiple processes or multiple threads.

In the video below, we discuss the most important differences between them.

Message Passing

The message passing paradigm is the most common way to parallelize scientific computing programs. In message passing, the tasks can only access their local data, and explicit action is required to exchange data between the parallel tasks, which in practice means calling specific subroutines or functions to perform the send or receive. For this purpose, there is a library called MPI (Message Passing Interface), whose communication routines can be called from Fortran and C/C++. Other languages (such as Python and Julia) also have extension packages that allow the use of the MPI.

When using MPI, the only option is to parallelize the whole program, as parallelizing is not possible part by part. MPI can be used both within the shared memory nodes as well as between them. Even though parallelizing a code with MPI may require a lot of work due to the explicit nature of communication and the fact that the subroutine/function calls have many parameters, the performance is typically good (if done right). Also, with MPI, the programmer is completely in charge of the parallelization, and nothing is left to the compiler.

The execution and data model are discussed in more detail in the video below.

The following is a simple MPI code example written in Fortran. When run, each task prints out its rank and the total number of tasks with which the program is launched.

program hello
  use mpi_f08

  implicit none

  integer :: ierror, rank, ntasks

  call MPI_INIT(ierror)

  call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

  write(*,*) 'Hello from task', rank, '/', ntasks

  call MPI_FINALIZE(ierror)

end program hello

The corresponding code in C:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv)
{
  int rank, ntasks;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &ntasks);

  printf("Hello from task %d / %d \n", rank, ntasks);

  MPI_Finalize();
}

In most Linux distributions, MPI is available in the package manager. The MPI programs are typically compiled with a special wrapper command, such as mpif90 (Fortran), mpicc (C), or mpic++ (C++).

$ mpicc -o hello hello.c

In order to launch N MPI tasks, a special launcher command like mpiexec in basic Linux systems is required. For example, for running the above code with four tasks, the command would be:

$mpiexec -n 4 ./hello
    Hello from task 0 / 4
    Hello from task 2 / 4
    Hello from task 1 / 4
    Hello from task 3 / 4

Typically, you want that the number of MPI tasks is the same as the number of cores to be used. In order to perform actual work in parallel, the data needs to be divided into different parallel tasks in order to prepare it for computation. Afterward, the results of individual tasks are combined. So, for example, with MPI, the sum of elements in an array could be carried out (in Fortran) as follows:

program parallel_sum
 use mpi_f08

 implicit none

 integer :: i, local_size
 real :: array(100) = [(i, i=0, 99)]
 real, allocatable :: local_array(:)
 real :: total_sum, local_sum
 integer :: ierror, rank, ntasks

 call MPI_INIT(ierror)

 call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)
 call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

 ! Allocate space for sub arrays and distribute data from rank 0
 local_size = 100 / ntasks
 allocate(local_array(local_size))
 call MPI_SCATTER(array, local_size, MPI_REAL, local_array, local_size, MPI_REAL, & 
                  0, MPI_COMM_WORLD, ierror)

 ! Each task calculates its own local sum
 local_sum = sum(local_array)

 ! Gather the data to rank 0 and sum in the fly (reduce)
 call MPI_REDUCE(local_sum, total_sum, 1, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, ierror)

 if (rank == 0) then
     write(*,*) 'sum of array', total_sum
 end if

 call MPI_FINALIZE(ierror)

end program parallel_sum

Threading

Shared memory parallelization, in which parallel threads can access the shared memory independently, can be used within a node. This, on the one hand, makes programming easier but can also lead to poor performance and seemingly random errors that are difficult to find.

The most popular shared-memory parallelization method in scientific computing supplements the code with OpenMP pseudo comments that tell a compatible compiler that the adjacent code can be parallelized. In Fortran, the pseudo comments are called directives, and in C/C++ pragmas, and they affect the compilation only if the compiler is instructed to look for them. At least, in theory, it is, therefore, possible to have a single source code that can be compiled for serial and parallel computing.

When using OpenMP, the programmer mostly relies on the compiler for parallelization, and the code can also be parallelized incrementally, which is quite convenient. This contrasts with MPI, where parallelization is an all-or-nothing approach, and, once parallelized, the code cannot be run serially, at least without the MPI library.

As an example, the parallel sum could be implemented with OpenMP (in C++) as:

#include <iostream>
#include <vector>

int main()
{
   auto array = std::vector<float>(100);

   for (int i=0; i < 100; ++i)
      array[i] = i;

   auto sum = array[0];
#pragma omp parallel for reduction(+:sum)
   for (int i=1; i < 100; ++i)
      sum += array[i];

   std::cout << "Sum of array " << sum << std::endl;

In order to compile with OpenMP, an additional flag is needed (with the GNU C++ compiler, -fopenmp is used, for example). The number of threads to use when running can be specified with the OMPNUMTHREADS environment variable, the default being the use of as many threads as there are cores in the node.

$ g++ -o sum sum.cpp -fopenmp
$ OMP_NUM_THREADS=4 ./sum
Sum of array 4950

While it is technically possible to use more threads than there are CPU cores, it is usually not beneficial as performance normally suffers from this.

It is also possible to combine the MPI and OpenMP parallelization, which means that within a shared memory node the OpenMP parallelization is used while between the nodes MPI is utilized instead.

GPU Programming

For GPU parallelization, the most common alternatives are currently the following:

CUDA is an extension to C by the GPU vendor Nvidia. With CUDA, the parts of the code that are to be run on CPUs are written in C/C++, while the computing-intensive parts offloaded to GPUs (called kernels) are written in CUDA. CUDA is a relatively low-level approach that requires a lot of work and careful programming to utilize the highly parallel processor. When done right, the performance is very good.
HIP can be considered the AMD version of CUDA, and the conversion between them is mostly one-to-one.
OpenACC is a directive-based approach to GPU programming available for Fortran and C/C++. The programmer inserts special pseudo comments based on which a compatible compiler generates the necessary machine code for GPU execution. Programming with OpenACC is easier than with CUDA, and the performance can be relatively good as well.
During the last few years, OpenMP has been extended to support GPU offloading, and, as such, it is nowadays very much comparable to OpenACC. Currently, the compiler support is not quite on a satisfactory level, but in the long run, OpenMP is expected, or at least hoped, to become mainstream in GPU programming.
There are various portability frameworks, such as Kokkos, Raja, and SYCL.

All the above approaches can be combined with MPI when using multiple GPU nodes.

As discussed earlier, not all problems are easily or efficiently adaptable to GPUs. That is the case with array element summing, which we have been using as an example. In the example code snippet below, the sum of two arrays is computed using CUDA

#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>

// The computational kernel that will be executed in GPU
__global__ void sum_arrays(float* input1, float* input2, float* output)
{
   // thread index
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   output[idx] = input1[idx] + input2[idx];
}

int main()
{
   // As the GPU works well only for with massive parallelism
   // we use a larger array i.e. a million elements
   size_t array_size = 1000000;
   auto array1 = std::vector<float>(array_size);
   auto array2 = std::vector<float>(array_size);
   auto array3 = std::vector<float>(array_size);

   for (int i=0; i < array_size; ++i) {
      array1[i] = i;
      array2[i] = -i;
   }

   // Allocate memory in the GPU
   float* array1_dev;   
   float* array2_dev;   
   float* array3_dev;   
   cudaMalloc((void**)&array1_dev, array_size * sizeof(float));
   cudaMalloc((void**)&array2_dev, array_size * sizeof(float));
   cudaMalloc((void**)&array3_dev, array_size * sizeof(float));

   // copy input data from CPU to GPU
   cudaMemcpy(array1.data(), array1_dev, array_size* sizeof(float), cudaMemcpyDeviceToHost);
   cudaMemcpy(array2.data(), array2_dev, array_size* sizeof(float), cudaMemcpyDeviceToHost);

   // Launch the compute kernel in GPU with "array_size" threads
   sum_arrays<<<10, array_size/10>>>(array1_dev, array2_dev, array3_dev);

   // copy results back to CPU from GPU
   cudaMemcpy(array3_dev, array3.data(), array_size* sizeof(float), cudaMemcpyHostToDevice);

   // Sum the result array, correct result is zero
   auto sum = array3[0];
   for (int i=1; i < array_size; ++i)
      sum += array3[i];

   std::cout << "Sum of array " << sum << std::endl;
}

You might not be familiar with C++, but it should be quite clear that even though the computational routine, compute_sum, is very simple by itself, some extra code is needed to use GPUs. Contrary to shared memory parallelization with threads in CPUs, many more threads are typically used than there are cores with GPUs.