Distributed memory computer



Because of the technical challenges of having very large numbers of CPU cores in a single shared-memory computer, all of today's supercomputers use the same basic approach in building a large system: take lots of separate computers and connect them together with a fast network.




Each separate computer, or node, operates independently and has its own memory totally separate from the other nodes. CPU cores in one node cannot directly access the memory in other nodes either; all the communication is done via the network. Each node also runs a separate copy of the operating system.

Continuing with the office analogy, a distributed-memory computer is like a set of workers distributed among several offices, each with their own whiteboard, communicating with each other by phone. With this analogy, we can clearly see both the advantages and disadvantages of the approach.



Advantages

  • "Infinite" capacity: The number of whiteboards (i.e., the total amount of memory) increases as we add more offices. When a problem is too large to fit in the memory of a node, we just use more nodes so that less memory per node is needed.

  • "Infinite" computing power: The number of workers increases as we add more offices. When a problem takes too long to solve, we just use more nodes to have less work per node, which in practice means more parallelism.

  • No "overcrowding" of the memory bus: As the number of workers per whiteboard is constant, memory access speed does not decrease when adding more nodes.

  • Low cost: It is relatively cheap to increase the memory and computing capacity of the system by adding more nodes.

Limitations

  • Possibly expensive communication: Following our analogy, the more calls are made between the offices, the more time is spent in communication.

  • Communication overheads: Using a phone takes more time than just reading or writing on the whiteboard.

  • Possibly saturated communication network: Two offices calling each other cannot be reached from the other offices at the same time. Although it is possible to leave a message or use group calls between several offices, there is still a limit on how quickly information can flow between the offices.

  • Data requirements: Data must be split across all the nodes, which can be challenging with large and complex data.

  • System and software requirements: Operating system and all software must be installed and maintained in many nodes. Fortunately, this can largely be automated.



Despite the limitations, it has turned out that it is much easier and cheaper to build networks connecting great quantities of computers than it is to have numerous CPU cores in a single shared-memory computer. This means it is relatively straightforward to build physically huge supercomputers - it is an engineering challenge, but one that the computer engineers seem to tackle skillfully!

However, the compromises made in building a supercomputer (many separate computers, each with their own private memory) lead to difficulties on the software side. Having built a supercomputer, we now need to write a program that can take advantage of the thousands of CPU cores, and software building can be quite challenging in the distributed-memory model.


Adapted from material in "Supercomputing" online-course (https://www.futurelearn.com/courses/supercomputing/) by Edinburgh Supercomputing Center (EPCC), licensed under Creative Commons SA-BY