How are supercomputers programmed?

Programming scientific applications for supercomputers shares many similarities with programming "ordinary" applications for desktop computers, mobile devices, or web services, and even some of the tools used are very similar. But then again, in some aspects programming for supercomputers is very different

An easy-to-use graphical user interface is an important element for desktop applications, whereas many scientific applications, especially non-commercial ones, might not even have a graphical user interface. Also, in a desktop application, whether an operation takes 0.01 seconds or 0.10 seconds is most likely irrelevant to the user, but if the operation is executed ten million times in a scientific application, the difference becomes ten days! Thus, good performance is often much more critical for supercomputer applications.

Another significant difference is that desktops only have a few CPU cores compared to the hundreds of thousands in the biggest supercomputers. Therefore, very different approaches are needed in parallel programming to fully utilize the hardware.

Performance analysis tools

One of the most important aspects of HPC programming is, of course, the performance of the code, i.e., how fast it runs and solves a given problem. As described earlier, a massively parallel computer is a highly complex machine, and writing code that is able to extract the best possible performance from the system is equally complicated. For example, the programmer has to consider parallelization and load balancing (make sure that all the parallel tasks have an equal amount of work), data structures and memory access patterns (loading data from contiguous memory locations and reusing it is good for performance), vectorization of calculations (utilizing the parallel computing capacity of a single core), communication between the parallel tasks (communication overhead typically stops parallel performance improvement eventually) and so on.

But by just timing the code, it is challenging to say anything about its performance. Instead, one can use various performance analysis tools to find out where problems lie and how program performance can be improved. These include e.g.

Profilers: With a profiler, one can find the so-called hotspots or the sections and even the lines of the code where most of the time is spent. Focusing the optimization efforts on these yields the most significant benefits.
Message passing visualization tools: The communication patterns and possible load imbalance can be found and analyzed.
Vectorization and memory usage analysis tools: Reveal how well the computing resources of the cores are utilized and the efficiency of the memory traffic. These tools can even suggest improvements.

Naturally, also debuggers (tools for searching programming errors) are very important but need to be able to handle parallel programs efficiently to be useful.

Portability

Portability is also another concept that needs to be considered while designing software. It addresses whether and how easily a program can be transferred from one system to another. In addition to differing hardware (most notably, CPUs and GPUs), systems may have different compilers and libraries. In an ideal world, a program could be compiled and run on any system with good performance without any modifications. However, in practice, this is not the case. Sometimes, the source code needs to be modified, or the program may compile but crashes when run, and sometimes, the performance is not sufficient.

CPU programs written in Fortran, C/C++, or Python and parallelized with MPI or OpenMP are generally portable with good performance. However, the GPU world is currently more complicated since the field is still relatively new and developing; there are a number of competing vendors, and proprietary languages and programming models often are designed by the vendors for specific hardware.

High performance libraries

While scientific computing programs solve or simulate a vast number of different problems, many of them need to perform similar subtasks of, e.g., numerical mathematics, such as solving systems of linear equations and eigenvalue problems or performing various matrix operations, fast Fourier transforms, and so on. To make the software developer's task more straightforward, the most common algorithms and operations are available in performance-optimized subroutine libraries that can be linked to the main program.

In addition to providing convenient tools for common tasks, the libraries can typically offer significantly improved performance over naïve algorithms. Getting everything out of the CPU can require tedious optimizations, which in many cases may mean different approaches depending on the underlying hardware. These high-performance libraries are often products of decades of development. They are maintained for the latest hardware and advances in languages and compilers.

Readily available libraries for common methods give a developer/scientist the freedom not to reinvent the wheel. This saves plenty of time and allows for more thought to be put into the actual science. Often designing your own software for generic methods/algorithms can lead to poor and inefficient code, and even worse, waste of time. Also, it is good to note that general and standardized packages maintained over advances in technology are an excellent measure to uphold portability. However, this is also a double-edged sword since some obscure and less-known packages may add dependencies that are often not met in all supercomputing environments.