Parallel processing is normally based either on using multiple processes or multiple threads.
In the video below, we discuss the most important differences between them.
The message passing paradigm is the most common way to parallelize scientific computing programs. In message passing, the tasks can only access their local data, and explicit action is required to exchange data between the parallel tasks, which in practice means calling specific subroutines or functions to perform the send or receive. For this purpose, there is a library called MPI (Message Passing Interface), whose communication routines can be called from Fortran and C/C++. Other languages (such as Python and Julia) also have extension packages that allow the use of the MPI.
When using MPI, the only option is to parallelize the whole program, as parallelizing is not possible part by part. MPI can be used both within the shared memory nodes as well as between them. Even though parallelizing a code with MPI may require a lot of work due to the explicit nature of communication and the fact that the subroutine/function calls have many parameters, the performance is typically good (if done right). Also, with MPI, the programmer is completely in charge of the parallelization, and nothing is left to the compiler.
The execution and data model are discussed in more detail in the video below.
The following is a simple MPI code example written in Fortran. When run, each task prints out its rank and the total number of tasks with which the program is launched.
program hello
use mpi_f08
implicit none
integer :: ierror, rank, ntasks
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) 'Hello from task', rank, '/', ntasks
call MPI_FINALIZE(ierror)
end program hello
The corresponding code in C:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
int rank, ntasks;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
printf("Hello from task %d / %d \n", rank, ntasks);
MPI_Finalize();
}
$ mpicc -o hello hello.c
In order to launch N MPI tasks, a special launcher command like mpiexec in basic Linux systems is required. For example, for running the above code with four tasks, the command would be:
$mpiexec -n 4 ./hello
Hello from task 0 / 4
Hello from task 2 / 4
Hello from task 1 / 4
Hello from task 3 / 4
Typically, you want that the number of MPI tasks is the same as the number of cores to be used.
In order to perform actual work in parallel, the data needs to be divided into different parallel tasks in order to prepare it for computation. Afterward, the results of individual tasks are combined. So, for example, with MPI, the sum of elements in an array could be carried out (in Fortran) as follows:
program parallel_sum
use mpi_f08
implicit none
integer :: i, local_size
real :: array(100) = [(i, i=0, 99)]
real, allocatable :: local_array(:)
real :: total_sum, local_sum
integer :: ierror, rank, ntasks
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
! Allocate space for sub arrays and distribute data from rank 0
local_size = 100 / ntasks
allocate(local_array(local_size))
call MPI_SCATTER(array, local_size, MPI_REAL, local_array, local_size, MPI_REAL, &
0, MPI_COMM_WORLD, ierror)
! Each task calculates its own local sum
local_sum = sum(local_array)
! Gather the data to rank 0 and sum in the fly (reduce)
call MPI_REDUCE(local_sum, total_sum, 1, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, ierror)
if (rank == 0) then
write(*,*) 'sum of array', total_sum
end if
call MPI_FINALIZE(ierror)
end program parallel_sum
Shared memory parallelization, in which parallel threads can access the shared memory independently, can be used within a node. This, on the one hand, makes programming easier but can also lead to poor performance and seemingly random errors that are difficult to find.
The most popular shared-memory parallelization method in scientific computing supplements the code with OpenMP pseudo comments that tell a compatible compiler that the adjacent code can be parallelized. In Fortran, the pseudo comments are called directives, and in C/C++ pragmas, and they affect the compilation only if the compiler is instructed to look for them. At least, in theory, it is, therefore, possible to have a single source code that can be compiled for serial and parallel computing.
When using OpenMP, the programmer mostly relies on the compiler for parallelization, and the code can also be parallelized incrementally, which is quite convenient. This contrasts with MPI, where parallelization is an all-or-nothing approach, and, once parallelized, the code cannot be run serially, at least without the MPI library.
As an example, the parallel sum could be implemented with OpenMP (in C++) as:
#include <iostream>
#include <vector>
int main()
{
auto array = std::vector<float>(100);
for (int i=0; i < 100; ++i)
array[i] = i;
auto sum = array[0];
#pragma omp parallel for reduction(+:sum)
for (int i=1; i < 100; ++i)
sum += array[i];
std::cout << "Sum of array " << sum << std::endl;
In order to compile with OpenMP, an additional flag is needed (with the GNU C++ compiler, -fopenmp is used, for example). The number of threads to use when running can be specified with the OMPNUMTHREADS environment variable, the default being the use of as many threads as there are cores in the node.
$ g++ -o sum sum.cpp -fopenmp
$ OMP_NUM_THREADS=4 ./sum
Sum of array 4950
While it is technically possible to use more threads than there are CPU cores, it is usually not beneficial as performance normally suffers from this.
It is also possible to combine the MPI and OpenMP parallelization, which means that within a shared memory node the OpenMP parallelization is used while between the nodes MPI is utilized instead.
For GPU parallelization, the most common alternatives are currently the following:
All the above approaches can be combined with MPI when using multiple GPU nodes.
As discussed earlier, not all problems are easily or efficiently adaptable to GPUs. That is the case with array element summing, which we have been using as an example. In the example code snippet below, the sum of two arrays is computed using CUDA
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
// The computational kernel that will be executed in GPU
__global__ void sum_arrays(float* input1, float* input2, float* output)
{
// thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
output[idx] = input1[idx] + input2[idx];
}
int main()
{
// As the GPU works well only for with massive parallelism
// we use a larger array i.e. a million elements
size_t array_size = 1000000;
auto array1 = std::vector<float>(array_size);
auto array2 = std::vector<float>(array_size);
auto array3 = std::vector<float>(array_size);
for (int i=0; i < array_size; ++i) {
array1[i] = i;
array2[i] = -i;
}
// Allocate memory in the GPU
float* array1_dev;
float* array2_dev;
float* array3_dev;
cudaMalloc((void**)&array1_dev, array_size * sizeof(float));
cudaMalloc((void**)&array2_dev, array_size * sizeof(float));
cudaMalloc((void**)&array3_dev, array_size * sizeof(float));
// copy input data from CPU to GPU
cudaMemcpy(array1.data(), array1_dev, array_size* sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(array2.data(), array2_dev, array_size* sizeof(float), cudaMemcpyDeviceToHost);
// Launch the compute kernel in GPU with "array_size" threads
sum_arrays<<<10, array_size/10>>>(array1_dev, array2_dev, array3_dev);
// copy results back to CPU from GPU
cudaMemcpy(array3_dev, array3.data(), array_size* sizeof(float), cudaMemcpyHostToDevice);
// Sum the result array, correct result is zero
auto sum = array3[0];
for (int i=1; i < array_size; ++i)
sum += array3[i];
std::cout << "Sum of array " << sum << std::endl;
}
You might not be familiar with C++, but it should be quite clear that even though the computational routine, compute_sum, is very simple by itself, some extra code is needed to use GPUs. Contrary to shared memory parallelization with threads in CPUs, many more threads are typically used than there are cores with GPUs.