Memory Hierarchy

Thus far, we have used the term memory when referring to the volatile storage attached to the CPU and shared between the CPU cores within the supercomputer node. In practice, however, there is a hierarchy of different levels of memory, and sometimes the memory of other nodes, which cannot be directly accessed, can also be included.

When a CPU core calculates 2.1 + 4.3, it first fetches the two numbers from the main memory to registers in the CPU core. The result also appears first in a register, and the value is then pushed from the register to the main memory. As we have discussed earlier, the speed of accessing memory can be a performance bottleneck. In order to alleviate that, modern CPUs both in consumer devices and supercomputers have memory caches.

Memory Cache

A memory cache is basically a small amount of very fast scratchpad memory close to the CPU core. The basic idea is that when a CPU core needs to load something from the main memory, it first looks in the cache. If the data is already there, it can be fetched faster from the cache to the register than from the main memory. Also, when a result of a computation is stored, it is first moved from the register into the cache and only copied to the main memory later, when the cache becomes full.

Think again of the analogy with workers in an office with a whiteboard. The whiteboard is the main memory, and the workers are now doing their computations at their desks. Every time a worker reads from or writes to the whiteboard, they need to leave the desk. Imagine now that each worker has a small notebook: when data needs to be read from the whiteboard, the notebook is filled with everything needed in order to work happily at the desk for a longer period of time. Especially when performing many computations with the same data, the notebook, or cache, can significantly speed up the overall memory access.

Writing Data

Cache works splendidly for multiple workers if they only read data. Unfortunately, real programs also write data, which in our analogy means that workers want to modify the data on the whiteboard as well. If two people are working on the same data simultaneously, a problem arises: if one worker changes numbers in their notebook, the other workers need to be informed about it. The compromise solution is to let everyone know whenever any results in the notebook are modified. Whenever altering a number, one has to, for example, shout out:

“I’ve just changed the entry for the 231st salary - if you have a copy of it, then you’ll need to get the new value from me!”

Although this is OK for a small number of workers, it clearly is problematic when there are lots of them. For example, imagine 100 workers: when changing a number, 99 other people need to be informed about it, which leads to wasting time. Worse, workers must constantly be on the lookout for updates from 99 other workers instead of concentrating on their own calculations.

This is the fundamental dilemma: memory access is so slow that small and fast caches are needed to access data as fast as it can be processed. However, whenever data is written, an overhead grows with the number of CPU cores and eventually makes everything slow down again.

Keeping the data consistent and up to date on all the CPU cores is called cache coherency. In our office analogy, this means that we always have up-to-date values in our notebook or, at the very least, know when our notebook is out of date and in need of updating by the whiteboard. Ensuring cache coherency is the major obstacle in building large multicore processors, and modern CPUs have not only one but multiple levels of cache. Furthermore, in order to further improve the memory access speed, most modern CPUs have not only one but multiple levels of cache.

The hardware handles moving data in and out from the caches, and programmers cannot directly control it. However, the way data and computations are organized in the program code may have an effect on how efficiently the caches can be utilized. For example, when simulating particles in three dimensions, the coordinates of the particles can be stored either in three lists of N numbers (each list containing the x, y, or z coordinate of the N particles) or in N lists of three numbers (each list containing the three coordinates of a single particle). Depending on the computations performed with the coordinates, the data can behave differently with the caches.

Memory Hierarchy Pyramid

The memory levels in the pyramid have the following characteristics:

Physical location: The higher the memory type is in the pyramid, the closer it is physically to the CPU core.
Performance: As we move towards the top of the pyramid, transfer speed (MB/s) increases.
Access time: The time between read/write requests decreases as we move up in the hierarchy.
Capacity: The amount of information that the memory type can store increases towards the bottom.
Cost per byte: The higher in the pyramid, the more costly the memory is. The main memory is more expensive than the disk.

Types of Memory:

Registers are the fastest type of memory. All the arithmetic operations on data are performed in registers. CPUs have general-purpose registers as well as registers for specific types of data and operations. Physically, registers are a part of the CPU core.
L1, L2, L3 caches are intermediate caches between the main memory and the registers. L1 cache is the smallest and fastest, while L3 is the largest and slowest. When the CPU core needs to fetch data, it looks first in the L1, then L2, and finally L3, before turning to the main memory if necessary. Each core often has its own L1 and L2 caches, but the L3 cache might be shared between some. As an example, the AMD Rome 64 core processors in CSC’s Mahti supercomputer have the following characteristics:
L1 cache: 32 KiB (+ 32 KiB for instructions), private to the core.
L2 cache: 512 KiB, private to the core.
L3 cache: 16 MiB, shared between four cores.

Main memory is the memory within a node in which all instructions and data of active programs are stored.
Remote memory is the main memory in another node. Accessing remote memory requires communication via the interconnect.
Disks can store data after a program has ended or the computer shuts down, unlike all the other types of memory discussed here. As accessing a disk is slow, it is usually used only when opening a program (to load the data) and again when closing it (to save the data). Sometimes a small logging data is written during the run of the program in order to restart in case of unexpected crashes. Some checkpoint data might also be written during the runtime.

Includes material from "Supercomputing" online-course (https://www.futurelearn.com/courses/supercomputing/) by Edinburgh Supercomputing Center (EPCC), licensed under Creative Commons SA-BY