Basic MPI
MPI (Message Passing Interface) is a standard interface for the
message passing paradigm of parallel computing. This is
a model of cooperative processes, working on separate data spaces
and interchanging messages when they need to share or communicate
data. The model implies an active role on both the part of the
sending process and the receiving process.
The machine model underlying MPI is that of a multicomputer:
a distributed memory machine with processors tied together via some
kind of interconnection fabric. The multicomputer model ignores
the underlying topology of the fabric, and presumes there are two
types of memory accesses: local and distant. However, it is possible
to impose a topology model on MPI, an advanced topic reserved for later.
Although MPI has an underlying distributed memory model, it can be
used for
- Distributed memory machines
- Shared memory machines
- Arrays of SMPs ("clusters")
- Networks of workstations
- Heterogeneous networks of machines
This is because the logical programming model does not have to
match the physical machine architecture. Generally you can get
more efficient code if your programming model matches the machine's
physical hardware, but the advantage of MPI is that by aiming at the
absolute minimal assumption about the hardware, your code will be
portable across the wide variety of actual systems listed above.
Ancient history
In the 1980's, distributed memory machine manufacturers all developed message passing
libraries for their machines - it was the only effective way to use them. However,
each company had its own model of message passing and different bindings to standard
languages such as C and Fortran. Codes running on one machine could only be ported to
another with great effort, and often it was not just a matter of translating one
message passing function call into another since the underlying models differed.
However, some general forms emerged. Most had send-receive primitives for
point-to-point communication where one processor sends a message and another receives
it. (The other flavor of communication is collective: an entire set of
processors is involved, e.g. in a broadcast message from a processor to all other
processors).
A send primitive typically had the format
send(address, length, destination, id) .
Here
- address gives the memory address for the beginning location
of the message in local memory.
- length gives the length of the message (typically in bytes).
- destination is an identifier indicating which other processor
should receive the message.
- id is a message identifier, needed in case multiple messages
get sent from one process to another.
The receive operation was similar:
receive(address, buffsize, source, id, length) , where
- address gives the memory address for the beginning location
of the buffer in local memory to place the incoming message
- buffsize gives the maximum size of the buffer
- length gives the actual length of the message (typically in bytes).
- source is an identifier indicating from which other processor
the message is coming
- id is the message identifier used by the corresponding send
This approach can handle virtually all that is needed by a distributed memory
model of computing, but has shortcomings. Messages of noncontiguous data
and sophisticated data objects also need to be communicated. Partitioning
the processes into groups that work on heterogenous parts of the code are
needed in some applications. Finally, for extremely large scale computations
you want to tie together machines that are different architectures, with
possibly different internal binary representations.
The few portable message passing systems were mostly university or national lab
research projects, and were incomplete, lacked vendor support, and were inefficient
since they introduced another layer on top of the machines' "native" message passing
libraries. Out of those, the only real survivor is PVM (Parallel Virtual Machine),
partly because it was the first that tried to get large scale vendor support.
Appearance of MPI
MPI is an industry standard, prompted by recognition on the part of
parallel system purchasers that code development was not cost effective
on those machines. Typically it took three or more years to
port and validate a major engineering or scientific code - but the parallel
systems were outdated every two years.
A MPI program consists of multiple processes, each with its own
address space. Each process runs the same program (SPMD model), but
has a unique number that identifies it. If there are
p MPI processes participating in a single program, a process's
identifying number is an integer between 0 and
p-1 and is called its "rank". In the statement of most
SPMD algorithms, there are lines like
if (myid == 0) { ... }
else { ... }
which is read "if my process identification number is zero, then do the
following; otherwise, do something else". This is how MIMD programs
are built on a SPMD model - each runs the same program, which branches
dependent on the process's ID number. In MPI, that ID number is called
its rank.
Note that most distributed memory machines can be run directly
in MIMD mode with each processor actually running a completely different
program. However, SPMD is the model most often used to emulate
MIMD actions. The reasons are psychological - it is easier to
have a single source code to write and examine.
Each message in MPI consists of the data sent and a
header . The header contains
- The rank of the sender
- The rank of the receiver
- A message identifier number called its tag
- A communicator identification
The MPI standard guarantees that the integers 0-32767 can be used as
valid tag numbers, but most implementations allow far more.
One basic concept in MPI is that of a
communicator group : a set of MPI processes that are grouped together
in working on a problem, which can send messages to each other.
For the start, we will use the default communicator group
MPI_COMM_WORLD , which sets a single context and involves all the
processes running. This is a predefined communicator, of type
MPI_Comm . Later we will cover more details about the concepts of
communicators and contexts. But to get an idea of why different communicator
groups may be needed in a single program, consider what happens if we are
running an MPI program that calls a math library - which was also built to
use parallelism via MPI. To keep process number 3 in our program getting
confused with process number 3 as defined by the library, we need to have
an additional identifier to distinguish them (and keep one from receiving
a message intended for the other). This additional identification is the
communicator group.
Although MPI has over 120 different functions that can be invoked, all
parallel programs can be built using just six:
- MPI_INIT() initializes MPI in a program.
- MPI_COMM_SIZE() returns the number of cooperating processes.
- MPI_COMM_RANK() returns the process identifier for the process that invokes it.
- MPI_SEND() sends a message.
- MPI_RECV() receives a message.
- MPI_FINALIZE() cleans up and terminates MPI
For our purposes, there are a few more that are useful right from the start:
- MPI_BCAST(): send a message from one processor to all the others in the
specified communicator group.
- MPI_ALLREDUCE(): perform a reduction operation , and make the
reduced scalar available to all participating processes.
The last one is useful for most dotproducts, since it is typically the case
that the resulting scalar is needed by all the processors.
For performance evaluation, we can also use:
- MPI_WTIME(): returns a double that gives the number of seconds since
either the beginning of the program, or 1 January 1970.
- MPI_WTICK(): returns a float that gives the clock resolution.
We will not use all of them immediately.
Here are some details about the ones
we need right away; this is for the C language versions.
- MPI_Init(&argc, &argv); This must be called at the beginning
of the parallel program, after whatever global initializations need
to be performed. Its two arguments are the two that a C function
main() takes (so that they can be passed along to all the
MPI processes).
- MPI_Comm_size(MPIcomm comm, &p); This takes the communicator group
as first argument (which will be MPI_COMM_WORLD for all our beginning
programs). It returns in the second argument the number of processes
participating in the communicator group.
- MPI_Comm_rank(MPIcomm comm, &myrank); This takes the communicator group
as first argument and returns the rank of the calling process.
- MPI_Finalize(); Takes no arguments; it just cleans up things. Try leaving
it off your program on the burrow to see what happens.
- MPI_Send(void* message, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm);
- MPI_Recv(void* message, int count, MPI_Datatype datatype, int src, int tag,
MPI_Comm comm, MPI_Status *status);
The last two are the most basic send/receive pair, and their required arguments are
- void* message: The beginning address of the block of memory containing the message.
- MPI_Datatype datatype: This is one of the allowed types, generally corresponding
to a C datatype. For example, these include (with C type in parentheses)
- MPI_CHAR (signed char)
- MPI_INT (signed int)
- MPI_DOUBLE (double)
- MPI_FLOAT (float)
- int count: The message consists of a "count" number of the given datatype.
- int src : the source of the message (part of MPI_Recv calling sequence)
- int dest: the destination for the message (part of MPI_Send calling sequence)
- int tag: the tag for the message.
Note that there is no argument giving the "buffsize" as was mentioned earlier
in generic send/receive functions. In MPI, if the sent message is too large
to fit into the receiving buffer, it causes either segmentation fault (the
best case) or weird corruption of your data.
The final argument of the MPI_Recv function gives information about
the message as actually received. It is a C structure with at least
three components,
the source, the tag, and an error code of type MPI_ERROR.
So if, for example, the receive used for the source field the wild-card
MPI_ANY_SOURCE, then status->MPI_SOURCE will contain the rank of the
process that sent the received message.
Note that the MPI_Status variable does not necessarily have a field
for the count of data items actually sent; you should use the function
MPI_Get_count() for that.
Last Modified: Thu 08 Feb 2018, 07:13 AM