Parallelization within CPU core

By now, we have been discussing parallelization at the level of multiple CPU cores. That is, we split a computational problem into smaller subproblems and assigned the subproblems to the CPU cores. However, modern CPU cores can, however, perform additional parallel operations within the core.

Instruction Level Parallelism

The CPU core executes a stream of instructions when running a computer program. The instructions can be, for example, to load data from the main memory to a register, to store data from a register to the main memory, or to do some basic arithmetic operation such as addition, subtraction, multiplication, or division.

The instructions are executed at the pace determined by the clock speed. For example, a CPU running at 2.1 GHz has 2.1 billion clock cycles per second. Instruction-level parallelism (ILP) refers to the ability of the CPU core to simultaneously execute multiple instructions within one clock cycle. For example, if the program has to perform two independent additions sequentially, and the core has two floating-point units, they can be executed in parallel. The programmer does not have direct access to this parallelization, but it is possible to arrange the program code in a special way to make it “easier” for the compiler and the hardware to utilize this type of parallelization.

Vectorization

Consider a case where you have a list of eight numbers or, in mathematical terms, a vector with a length of eight, and you would like to sum the numbers by element with those in another similar list. Vectorization refers to the ability of the CPU core to perform operations on multiple data simultaneously, often referred to as Single Instruction, Multiple Data (SIMD) parallelization.

If the CPU core can perform the addition with four numbers in a single vector instruction, it is said that it has a vector length equal to four floating numbers. The whole addition would be completed in two instructions, and the floating-point performance FLOP/s is improved by a factor of four compared to a scalar case.

The programmer can adapt the vectorization by writing the code in a suitable form and providing hints for the compiler to vectorize suitable operations.

Basically, all CPUs used in modern supercomputers support vector instructions. For example, the current Intel CPUs in 2021 can perform arithmetic operations on eight floating-point numbers simultaneously, or 16 numbers with reduced precision, and AMD CPUs on four floating-point numbers or eight numbers with reduced precision. Both also provide instruction-level parallelization and are capable of performing two additions and two multiplications in a clock cycle. Thus, the AMD CPUs in the Mahti supercomputer having 64 cores and running at 2.6 GHz has a theoretical peak performance of 2660 GFLOP/s.

  64   x       4         x   (2 + 2)  x  2.6 GHz     = 2660 GFLOP/s
cores    vector length         ILP      clock speed