Maximum practical performance

Modern CPU cores perform arithmetic operations exceptionally quickly, and the main bottleneck is typically the speed at which feed data can be transferred to the CPU. Disregarding the disk, which is mostly used when opening and closing an application, the slowest part of the chain feeding data to the CPUs in a supercomputer is the interconnect.

The CPU cores of Mahti running at 2.6 GHz can perform about 20,000 floating-point operations in 0.5 microseconds, which may sound impressive. However, communicating 20,000 floating-point operations with 200 Gb/s of bandwidth takes about 6.4 microseconds, and in that time, the CPU core could theoretically perform over 270,000 floating-point operations. Therefore, a CPU core only achieves about 7% of its theoretical peak performance, assuming that all the instruction-level parallelism and vectorization are fully utilized.

In practice, the situation is not that bad, as not all the data needs to be communicated via the interconnect since, in many cases, the same data is used for several arithmetic operations. However, some problems are closer to the trivially parallel case described earlier, and in these cases, the main memory within a node is typically the main bottleneck. The various caches, algorithmic choices, and programming techniques can increase the achievable performance, as in some cases, it is possible to perform computation and communication simultaneously, for example.

Different scientific problems have different inherent limits, such as how many operations are performed per each memory/interconnect access or how often data needs to be copied between CPUs and GPUs. In best scenarios, such as in the case of the LINPACK benchmark, however, it is possible to reach up to 80% of the theoretical peak performance of a supercomputer.

Includes material from "Supercomputing" online-course (https://www.futurelearn.com/courses/supercomputing/) by Edinburgh Supercomputing Center (EPCC), licensed under Creative Commons SA-BY