I/O and Collective Communications with MPI-1
A problem with using MPI-1 is I/O. Suppose in the round-robin
communication you want to input the message length and a value
to use to fill in the messsage. So you put in a line
scanf("%d %f", &size, &val);
and then the MPI program is run with three processors and the
user inputs the line
24 3.1415
Who gets what? Does each process get size and val, or does just
process 0 get them? It is possible to have process 1 get the
"24" while process 0 gets the "3.1415" and process 2 hanging, waiting
in vain for input.
Because of these problems, always have just process 0 do I/O, and
relay the results on to the other processes. In MPI, it is guaranteed
that the process with rank 0 can do I/O. In practice vendors usually
try to supply better I/O mechanisms, but that is a new part of the MPI
standard (MPI2).
The algorithm to have just process 0 handle input would be something like
if (myrank = 0)
read in integer i , double d
for k = 1 to p-1
MPI_Send message with i to process of rank k.
MPI_Send message with d to process of rank k.
end for
else
MPI_Recv message with i from process of rank 0.
MPI_Recv message with d from process of rank 0.
end if
This is not scalable. Part of the problem is that
one processor does all the reading of data, but that
is all we can count on being able to do with MPI. A larger scalability
problem is that one processor
sends all the messages - and this we can improve on.
The trick is to use a tree code to send the messages out. Suppose
that the number p of processes is a power of 2. Then use the algorithm
k = p/2
not_received = true
for i = 1 to log(p)
if (myrank() mod 2k = 0) then
MPI_Send data to myrank()+k
else if (not_received and myrank() mod 2k = k) then
MPI_Recv data
not_received = false
endif
k = k/2
endfor
This is the classical binary tree for data transfers, with
time moving downwards:
0
/ \
/ \
/ \
/ \
0 4
/ \ / \
/ \ / \
0 2 4 6
/ \ / \ / \ / \
0 1 2 3 4 5 6 7
Of course, no message needs to be sent long the left branches
at each interior node.
Just to test your understanding, in the algorithm given above,
what rank number should the process issuing the
MPI_Recv() give as the "source" field?
Collective Communications
Because it is often the case that one process needs to send or receive
data to or from all other processes, MPI provides
collective communication functions. If you are lucky, the
vendor has optimized them for the machine topology; in MPICH they use
a tree algorithm like the one above. Here are some of the functions:
- int MPI_Bcast(void *msg, int count, MPI_Datatype, int root, MPI_Comm):
sends a message from process root to all others in the communicator group.
This function must be called by all participating processes.
Also, count and datatype must match on all processes, unlike MPI_Send
and MPI_Recv.
- MPI_Reduce(void *operand, void *result, int count, MPI_Datatype, MPI_OP,
int root, MPI_Comm): combines operands stored in operand and leaves
the answer in result on process with rank "root". Here count, datatype,
and operation MPI_OP must be the same on all processes. The types of
operations are MAX, MIN, SUM, PROD, and various logical operations.
Warning: although only root gets the result, all the participating
processes must provide void *result, and all must provide the actual
storage space it uses.
- Often you want a global reduction operation with the result left on
on every process, not just a root one. Instead of following MPI_Reduce
with a MPI_Bcast, use instead MPI_Allreduce(). In general, the "All"
word embedded in an MPI function means to have the operation result
end up in all tasks in the communicator group.
Other collective communications that can be useful are MPI_Gather, MPI_Scatter,
MPI_Allgather, and MPI_Allgatherv. But the most important one for debugging is
MPI_Barrier(MPI_Comm comm). Its only argument is the communicator group, and
it blocks processes until they have all called it. This is a sychronization
primitive. The most common mistake in parallel computing is to implicitly assume
a synchronization that does not in fact exist - except in your mind. You should
follow each communication phase in your program with a MPI_Barrier, until the
code is running correctly. Then one-by-one remove the barriers that you think
are unnecessary.
Note about "Blocking"
Until now we have used blocking send and receive. Here is what blocking means:
- Blocking send: the program will not return from the call to MPI_Send
until the send buffer can be re-used, i.e., until the sending process can
safely overwrite the message buffer.
- Blocking receive: the receiving process will not return from a call to
MPI_Recv until the receive buffer actually contains the contents.
There is almost no synchronization implied by this. Typically, an outgoing message
is buffered by the system. So the sending process can return from MPI_Send when
the receiving process has not started receiving the message, or even
posted a corresponding MPI_Recv. Beware of this.
Equally weird, a receive can complete before the matching send completes.
See if you can figure out how that could happen!
Although it seems that the above is haphazard, there is one property that
MPI imposes on messages: they will be non-overtaking, that is,
if process 0 sends out three messages a, b, and c in that order and with
identical tags to process 1,
and process 1 posts three receives, then the order the messages will be
received is a, b, c. However, if more than two processors are involved,
and process 2 also sends messages d, e, and f to process 1, and if process 1
has specified a wildcard src = MPI_ANY_SOURCE, then messages from process 0
can be mingled in any order with those from process 2. So, for example,
orders in which process 1 could receive messages are
- a, b, d, e, f, c
- d, a, e, b, c, f
- a, d, e, b, c, f
but it could not receive the messages in the order
- Last Modified: Thu 08 Feb 2018, 07:17 AM