When the input data size is kept constant while the number of processing units is increased, we typically speak about strong parallel scaling. The purpose of parallel programming, in this case, is to decrease the time in solving the problem. The parallel speed-up can be defined as
$ S = \frac{Ts}{T{p}}, $
where $Ts$ is the execution time in serial, and $Tp$ is the execution time with $p$ processing units.
In an ideal situation, the speed-up should be directly proportional to the number of processing units (Figure below). If, for example, Alice and Bob would invite their mathematically equally skilled friends Joe and Lucy to help in summing up the numbers, the solution would take only five seconds compared to 20 if Alice did it alone. But, unfortunately, in real-world problems, this is rarely the case. In fact, increasing the number of processing units beyond a certain tipping point might conversely extend the execution time.
As discussed earlier, parallel computing is not only used for speeding up problem-solving but also for enabling studies of larger or more specific issues. In these situations, both the amount of data and number of processing units are increased simultaneously, which leads to weak parallel scaling. In an ideal weak scaling case, the execution time remains constant when the amount of data and the number of processing units are increased in the same proportion. For example, if Alice wanted to sum 80 numbers instead of 20 and asked Bob, Joe, and Lucy to help, 20 seconds would ideally suffice as in the original problem.
Like strong scaling, ideal weak scaling is rarely feasible with real-world problems, even though reliable weak scaling is typically easier to achieve than reliable, strong scaling.
Several factors can limit parallel scaling. Typically, a parallel program needs to perform some additional operations which are not present in a serial program. There may be some redundant computations, data may need to be transferred, and processing units may need to be synchronized. If there is an imbalance in distributing the workload, the execution time is limited by the slowest execution unit, as the others need to wait for it to finish its job. There can be serial parts in the program as well (i.e., parts that cannot be parallelized). If we designate the fraction of the problem that can be parallelized $pf$, the maximum possible speed-up (so called Amdahl's law) is
$
S{max} = \frac{1}{1 - p_f}
$
For example, if only 90% of the problem can be parallelized, the maximum speed-up is 10, regardless of the number of CPU cores used.
To summarize, the main causes limiting parallel scaling are:
Let's illustrate some of these limits with the help of Alice and friends and assume it takes one second to sum two numbers and 0.1 seconds to communicate a single number. For Alice alone, the solution would take 19 seconds when there are a total of 20 numbers to work with (i.e. $T_s=19$, the number 19 representing the number of operations to complete). When working with Bob, both can sum their 10 numbers simultaneously in nine seconds if we disregard the time needed for setting up and dividing the work. However, Bob needs to communicate his partial result to Alice, and Alice needs to perform the final summation. Thus, the total time required is
$ T_2 = 9 + 0.1 + 1 = 10.1 $
When Joe and Lucy join in, each of the four can simultaneously sum five numbers, but more time is needed to communicate the partial sums to Alice and for Alice to perform the final calculation:
$ T_4 = 4 + 3 \cdot 0.1 + 3 \cdot 1 = 7.3 $
If we increase the number of workers even further to eight people, we notice a new issue: 20 numbers cannot be distributed evenly to eight workers (20/8=2.5), which means there is a load imbalance. The best option is to have four people process three numbers and have the others work with two numbers (4⋅3+4⋅2=20). However, as all parallel work needs to be finished before Alice can gather the partial results and combine them into the final one (this type of operation is called reduction), the "parallel time" of the operation is determined by the workers processing three numbers, and the total time is
$ T_8 = 2 + 7 \cdot 0.1 + 7 \cdot 1 = 9.7 $
Thus, it takes more time for eight workers than for four workers! If the work distribution was done in a less intelligent manner, say, seven workers processing two numbers and a single worker processing six numbers, the time needed would be 12.7 seconds, which would be even slower than two workers counting ten 10 numbers each.
In this example, we notice that with more workers, the computing time an individual worker needs decreases, but the communication time as well as the time needed for the final summation increases consequently. Even though the reduction can, in practice, be performed in a somewhat more intelligent way than shown here, it will still induce a parallel overhead. In general, the smaller we make the subproblems, the more severe the communication and load imbalance overheads become. For example, if the problem would be to sum 1001 numbers (still having some load imbalance), the overheads would be much less significant:
$ T_s = 1000, $
$ T_8 = 125 + 7 \cdot 0.1 + 7 \cdot 1 = 132.7, $
and
$ S = \frac{Ts}{T{8}} = 7.5 $
which would be considered a very good parallel speed-up.
Also, if we consider a weak scaling case where the problem size is increased to 160 numbers and we have eight workers, the time needed would be $19 + 7 \cdot 0.1 + 7 \cdot 1 = 26.7 s$.
Even though most real-world problems are much more complex than the one Alice and friends are solving, it is clear that it can be very challenging to minimize parallel overheads and maximize the ratio of computing to communication while maintaining a good load balance when using many parallel workers.