The division is not strict; the Illinois Cedar machine (circa late 1980's) had local memory attached to each group of eight processors forming a shared memory component, with each processor able to access a global shared memory as well. Long before multicore systems, many companies provided "distributed shared memory" machines, which present a single address space to the user, but have that memory physically distributed. The OS, runtime system, and hardware then handles accessing the right entry and the user does not have to keep track of which processor "owns" it. They are also called NUMA machines, for Nonuniform Memory Access, since it usually takes longer to access an operand physically residing on a remote memory than on a local one. Another related term is CC-NUMA: cache-coherent NUMA.
Some understanding of basic uniprocessor memory systems is needed. The emphasis on memory systems here is because of the fundamental performance principle of scientific computing: most numerical computations are limited not by processor speed, but by the time of getting data to and from the processor.
An interleaved memory with b banks is said to be b-way interleaved, no big surprise. Memory address m then resides in memory bank mod(m,b). This way consecutive addresses reside in different banks so that if a program is accessing one word after another, the memory system can have the different banks processing the requests simultaneously. The processor can request a transfer from location m on one cycle and from m+1 on the next cycle; the information will be returned on successive cycles. Note that the latency of the request, i.e. the number of cycles a processor has to wait before receiving the contents of a single location is not affected, but the bandwidth is improved. If there are enough banks the memory system can potentially send information at a rate of one word per processor cycle, regardless of what the memory cycle time is.
The decision to allocate addresses as contiguous blocks or in interleaved fashion depends on how one expects information to be accessed. Programs are compiled so instructions reside in successive addresses, so there is a high probability that after a processor executes the instruction at location m it will execute the instruction at m+1. Compilers can also allocate vector elements to successive addresses, so operations on entire vectors can take advantage of interleaving. For these reasons, vector processors invariably have interleaved memory. However, shared memory multiprocessors use a block-oriented scheme since memory referencing patterns in an MIMD system are quite different. There the goal is to connect a processor to a single memory and use as much information as possible from that memory before switching to another memory.
The reason for using a cache and cache lines is based on data locality: if you used the word at location m on one step, the next word you access is likely to have an address near or adjacent to the one just accessed. When the cache is full and a new line is brought in, some line must be removed. The most commonly used replacement policy is LRU: least recently used. The line that was accessed the most distantly in time is replaced. Here the idea is based on temporal locality; recently accessed words are more likely soon to be accessed again (think about a loop index variable, for example.)
How a line is written back to memory can be in two ways:
Modern processors use multiple levels of cache; three is common and four increasingly so. This trend is helpful for serial computing - otherwise the vendors would not build them. However, this "deep memory hierarchy" has some serious consequences for parallel computing. Essentially, when processors need to communicate by sending a message or datum from one to the other, that datum must burrow its way upwards through the first processor's memory hierarchy, then downwards through the second processor's memory hierarchy, before it can be used by the second processor. So in addition to the cost of sending the data across whatever communication substratum exists, there is the cost of traversing two memory systems - and perturbing data in the caches along the way.
The problem with this design is that processors contend for access to the bus. If processor P is fetching an instruction, all other processors must wait until the bus is free. If there are only two processors they can perform close to their maximum rate since the bus can alternate between them: as one processor is decoding and executing an instruction, the other can be using the bus to fetch its next instruction. However, when a third processor is added performance begins to degrade. Usually 10 processors connected to the bus flattens out the performance curve so that adding more processors does not increase performance. The memory and bus have a fixed bandwidth, determined by a combination of the cycle time of the memory and the bus protocol, and in a single-bus multiprocessor this bandwidth is divided among several processors. If the processor cycle time is slow compared to the memory cycle, a fairly large number of processors can be accommodated by this plan, but since processor cycles are usually faster than memory cycles this scheme is not scalable.
A modification to this design will improve performance, but it cannot indefinitely postpone the flattening of the performance curve. If each processor has its own local cache and data locality of the program is good, then it is likely that the data it needs is in the local cache. A good cache hit rate will greatly reduce the number of accesses a processor makes and thus improve overall efficiency. The dogleg of the performance curve, which identifies a point where it is still cost-effective to add processors, extends to around 20 processors, and the curve does not flatten out until around 30 processors.
Giving each processor its own cache introduces the cache coherency problem.
Suppose two processors use data item A
, so A
ends up in the
cache of both processors. Next suppose processor 1 performs a calculation that
changes A
. When it is done, the new value of A
is written
out to main memory. At a later time, processor 2 needs to fetch A
.
However, since A
was already in its cache, it will use the cached value
and not the newly updated value calculated by processor 1. Maintaining a consistent
version of shared data requires providing new versions of the cached data to each
processor whenever one of the processors updates its copy. The typical approach is
called a "snooping protocol", where each processor "listens" on the bus for address
requests and update postings.
Another way of building a shared memory multiprocessor is to replace the bus with a switch that routes requests from a processor to one of several different memory modules. Even though there are several physical memories, there is one large virtual address space. The advantage of this organization is based on having switchs that can handle multiple requests in parallel. Each processor can be paired up with a memory, and each can then run at full speed as it accesses the memory it is currently connected to. Contention still occurs, since if two processors make requests of the same memory module only one will be given access and the other will be blocked.
Various switch designs include
As the material on interconnection networks between processors and memory shows, the problems of bandwidth limits and network congestion can be alleviated by having a large cache with each processor - at the price of worrying about cache coherency. If this idea is carried to extremes, we move all of the memory to be local to the processors. This gives a distributed memory system, one where each processor has its own memory - and its own address space. Now the programmer is required to explicitly distribute the program data amongst the processors, synchronize between them, and communicate results between the processors by sending messages.
The advantage of distributed memory systems is that they are more "scalable". The word scalable is bandied about a great deal in parallel computing, but it is like the word "soup" - it means drastically different things to different people at different times. Here scalability is primarily of an architectural variety: distributed memory machines consist of fungible components, and you buy more and plug them in as needed. With a suitable problem and code, their performance can also be scalable. But using distributed memory introduces two sources of overhead: it takes time to construct and send a message from one processor to another, and a receiving processor must be interrupted to deal with messages from other processors.
So in a distributed memory system the memory is associated with individual processors and a processor is only able to address its own memory. Sometimes this is called a multicomputer system, since the building blocks in the system are themselves small computer systems complete with processor and memory. The IBM SP/2, for example, was originally just a collection of RS/6000 workstations tied together with a fast interconnect network, with each RS/6000 running its own copy of the OS.
In a distributed memory system, each processor can utilize the full bandwidth to its own local memory without interference from other processors. There is no inherent limit to the number of processors as with bus-based systems. The size of the system is now constrained only by the network used to connect processors to each other. There are no cache coherency problems (more accurately, the user becomes responsible for maintaining coherency). Each processor is in charge of its own data, and other processors cannot access it without going through explicit actions commanded by the program.
Programming on a distributed memory machine means organizing your program as a set of independent tasks that communicate with each other via messages. The programmer must be aware of where data is stored (that is, on which processor it resides), which introduces a new form of locality in algorithm design. An algorithm that allows data to be partitioned into discrete units and then runs with minimal communication between units will be more efficient than an algorithm that requires random access to global structures.
Semaphores, monitors, and other concurrent programming techniques are not directly applicable on distributed memory machines, but they can be implemented by a layered software approach. User code can invoke a semaphore, for example, which is itself implemented by passing a message to the node that ``owns'' the semaphore. This approach is not efficient.
Which programming style is easier - shared memory with semaphores, etc. or distributed memory with message passing - is often a matter of background; however, most users find the shared memory model easier to deal with. The message passing style can fit well with an object oriented programming methodology, and if a program is already organized in terms of objects it may be quite easy to adapt it for a distributed memory system. Choosing to implement implement a program in shared memory versus distributed memory is usually based on the amount of information that must be shared by parallel tasks. Whatever information is shared among tasks must be copied from one node to another via messages in a distributed memory system, and this overhead may reduce efficiency to the point where a shared memory system is preferred.
Single nodes in a distributed memory system are called processing elements, or PEs. To any PE, the other PEs are simply I/O devices. To send a message to another PE, a processor copies information into a buffer in its local memory and then tells its local controller to transfer the information to an external device, much the same way a disk controller in a microcomputer would write a block on a disk drive. In this case, however, the block of data is transferred over the interconnection network to an I/O controller in the receiving node. That controller finds room for the incoming message in its local memory and then notifies the processor that a message has arrived. On the Intel Paragon, to avoid tying up the computations while the communication is going on, each PE contained two i860 processors. One handles communications alone and the other handles computation, allowing the two to be overlapped. Modern multicore systems have made that design clunky and unnecessary, however.
As always happens in computer science when there are two paradigms, each with complementary strengths, hybridization efforts try to build systems that have the strengths of both. A blurring of the distinction between shared and distributed memory systems has been going on recently, and takes at least three forms: