The CPU cores of Mahti running at 2.6 GHz can perform about 20,000 floating-point operations in 0.5 microseconds, which may sound impressive. However, communicating 20,000 floating-point operations with 200 Gb/s of bandwidth takes about 6.4 microseconds, and in that time, the CPU core could theoretically perform over 270,000 floating-point operations. Therefore, a CPU core only achieves about 7% of its theoretical peak performance, assuming that all the instruction-level parallelism and vectorization are fully utilized.
In practice, the situation is not that bad, as not all the data needs to be communicated via the interconnect since, in many cases, the same data is used for several arithmetic operations. However, some problems are closer to the trivially parallel case described earlier, and in these cases, the main memory within a node is typically the main bottleneck. The various caches, algorithmic choices, and programming techniques can increase the achievable performance, as in some cases, it is possible to perform computation and communication simultaneously, for example.
Different scientific problems have different inherent limits, such as how many operations are performed per each memory/interconnect access or how often data needs to be copied between CPUs and GPUs. In best scenarios, such as in the case of the LINPACK benchmark, however, it is possible to reach up to 80% of the theoretical peak performance of a supercomputer.
Includes material from "Supercomputing" online-course (https://www.futurelearn.com/courses/supercomputing/)
by Edinburgh Supercomputing Center (EPCC), licensed under Creative Commons SA-BY