Basic MPI

Overview
Basic MPI Concepts

Overview

MPI (Message Passing Interface) is a standard interface for the message passing paradigm of parallel computing. This is a model of cooperative processes, working on separate data spaces and interchanging messages when they need to share or communicate data. The model implies an active role on both the part of the sending process and the receiving process.

The machine model underlying MPI is that of a multicomputer: a distributed memory machine with processors tied together via some kind of interconnection fabric. The multicomputer model ignores the underlying topology of the fabric, and presumes there are two types of memory accesses: local and distant. However, it is possible to impose a topology model on MPI, an advanced topic reserved for later.

Although MPI has an underlying distributed memory model, it can be used for

Distributed memory machines
Shared memory machines
Arrays of SMPs ("clusters")
Networks of workstations
Heterogeneous networks of machines

This is because the logical programming model does not have to match the physical machine architecture. Generally you can get more efficient code if your programming model matches the machine's physical hardware, but the advantage of MPI is that by aiming at the absolute minimal assumption about the hardware, your code will be portable across the wide variety of actual systems listed above.

Ancient history

In the 1980's, distributed memory machine manufacturers all developed message passing libraries for their machines - it was the only effective way to use them. However, each company had its own model of message passing and different bindings to standard languages such as C and Fortran. Codes running on one machine could only be ported to another with great effort, and often it was not just a matter of translating one message passing function call into another since the underlying models differed. However, some general forms emerged. Most had send-receive primitives for point-to-point communication where one processor sends a message and another receives it. (The other flavor of communication is collective: an entire set of processors is involved, e.g. in a broadcast message from a processor to all other processors).

A send primitive typically had the format send(address, length, destination, id) . Here

address gives the memory address for the beginning location of the message in local memory.
length gives the length of the message (typically in bytes).
destination is an identifier indicating which other processor should receive the message.
id is a message identifier, needed in case multiple messages get sent from one process to another.

The receive operation was similar: receive(address, buffsize, source, id, length) , where

address gives the memory address for the beginning location of the buffer in local memory to place the incoming message
buffsize gives the maximum size of the buffer
length gives the actual length of the message (typically in bytes).
source is an identifier indicating from which other processor the message is coming
id is the message identifier used by the corresponding send

This approach can handle virtually all that is needed by a distributed memory model of computing, but has shortcomings. Messages of noncontiguous data and sophisticated data objects also need to be communicated. Partitioning the processes into groups that work on heterogenous parts of the code are needed in some applications. Finally, for extremely large scale computations you want to tie together machines that are different architectures, with possibly different internal binary representations.

The few portable message passing systems were mostly university or national lab research projects, and were incomplete, lacked vendor support, and were inefficient since they introduced another layer on top of the machines' "native" message passing libraries. Out of those, the only real survivor is PVM (Parallel Virtual Machine), partly because it was the first that tried to get large scale vendor support.

Appearance of MPI

MPI is an industry standard, prompted by recognition on the part of parallel system purchasers that code development was not cost effective on those machines. Typically it took three or more years to port and validate a major engineering or scientific code - but the parallel systems were outdated every two years.

Basic MPI Concepts

A MPI program consists of multiple processes, each with its own address space. Each process runs the same program (SPMD model), but has a unique number that identifies it. If there are p MPI processes participating in a single program, a process's identifying number is an integer between 0 and p-1 and is called its "rank". In the statement of most SPMD algorithms, there are lines like

   if (myid == 0) { ... }
     else { ... }

which is read "if my process identification number is zero, then do the following; otherwise, do something else". This is how MIMD programs are built on a SPMD model - each runs the same program, which branches dependent on the process's ID number. In MPI, that ID number is called its rank. Note that most distributed memory machines can be run directly in MIMD mode with each processor actually running a completely different program. However, SPMD is the model most often used to emulate MIMD actions. The reasons are psychological - it is easier to have a single source code to write and examine.

Each message in MPI consists of the data sent and a header . The header contains

The rank of the sender
The rank of the receiver
A message identifier number called its tag
A communicator identification

The MPI standard guarantees that the integers 0-32767 can be used as valid tag numbers, but most implementations allow far more. One basic concept in MPI is that of a communicator group : a set of MPI processes that are grouped together in working on a problem, which can send messages to each other. For the start, we will use the default communicator group MPI_COMM_WORLD , which sets a single context and involves all the processes running. This is a predefined communicator, of type MPI_Comm . Later we will cover more details about the concepts of communicators and contexts. But to get an idea of why different communicator groups may be needed in a single program, consider what happens if we are running an MPI program that calls a math library - which was also built to use parallelism via MPI. To keep process number 3 in our program getting confused with process number 3 as defined by the library, we need to have an additional identifier to distinguish them (and keep one from receiving a message intended for the other). This additional identification is the communicator group.

Basic MPI Functions

Although MPI has over 120 different functions that can be invoked, all parallel programs can be built using just six:

MPI_INIT() initializes MPI in a program.
MPI_COMM_SIZE() returns the number of cooperating processes.
MPI_COMM_RANK() returns the process identifier for the process that invokes it.
MPI_SEND() sends a message.
MPI_RECV() receives a message.
MPI_FINALIZE() cleans up and terminates MPI

For our purposes, there are a few more that are useful right from the start:

MPI_BCAST(): send a message from one processor to all the others in the specified communicator group.
MPI_ALLREDUCE(): perform a reduction operation , and make the reduced scalar available to all participating processes.

The last one is useful for most dotproducts, since it is typically the case that the resulting scalar is needed by all the processors. For performance evaluation, we can also use:

MPI_WTIME(): returns a double that gives the number of seconds since either the beginning of the program, or 1 January 1970.
MPI_WTICK(): returns a float that gives the clock resolution.

We will not use all of them immediately. Here are some details about the ones we need right away; this is for the C language versions.

MPI_Init(&argc, &argv); This must be called at the beginning of the parallel program, after whatever global initializations need to be performed. Its two arguments are the two that a C function main() takes (so that they can be passed along to all the MPI processes).
MPI_Comm_size(MPIcomm comm, &p); This takes the communicator group as first argument (which will be MPI_COMM_WORLD for all our beginning programs). It returns in the second argument the number of processes participating in the communicator group.
MPI_Comm_rank(MPIcomm comm, &myrank); This takes the communicator group as first argument and returns the rank of the calling process.
MPI_Finalize(); Takes no arguments; it just cleans up things. Try leaving it off your program on the burrow to see what happens.
MPI_Send(void* message, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
MPI_Recv(void* message, int count, MPI_Datatype datatype, int src, int tag, MPI_Comm comm, MPI_Status *status);

The last two are the most basic send/receive pair, and their required arguments are

void* message: The beginning address of the block of memory containing the message.
MPI_Datatype datatype: This is one of the allowed types, generally corresponding to a C datatype. For example, these include (with C type in parentheses)
- MPI_CHAR (signed char)
- MPI_INT (signed int)
- MPI_DOUBLE (double)
- MPI_FLOAT (float)
int count: The message consists of a "count" number of the given datatype.
int src : the source of the message (part of MPI_Recv calling sequence)
int dest: the destination for the message (part of MPI_Send calling sequence)
int tag: the tag for the message.

Note that there is no argument giving the "buffsize" as was mentioned earlier in generic send/receive functions. In MPI, if the sent message is too large to fit into the receiving buffer, it causes either segmentation fault (the best case) or weird corruption of your data.

The final argument of the MPI_Recv function gives information about the message as actually received. It is a C structure with at least three components, the source, the tag, and an error code of type MPI_ERROR. So if, for example, the receive used for the source field the wild-card MPI_ANY_SOURCE, then status->MPI_SOURCE will contain the rank of the process that sent the received message. Note that the MPI_Status variable does not necessarily have a field for the count of data items actually sent; you should use the function MPI_Get_count() for that.

Last Modified: Thu 08 Feb 2018, 07:13 AM