On Dec 9, 2004, at 4:03 PM, Brad Penoff wrote:
> FYI, the code most definitely works if Wait(B) is thrown in there. I
> think we were more concerned about the "what if's"...
Good. Whew. :-)
> I'm definitely still confused when I read the standard regarding these
> semantics then; I had thought Wait(C) implied Wait(B), especially
> since they share a tag, communicator/context, and due to the order in
> which they are initiated.
Nope. But this is an excellent question. :-)
In short, here's what the deal is:
- MPI will deliver messages between two processes on the same
communicator with the same tag in order as they were posted.
- For a non-blocking communication, you can never assume that the
contents of the buffer are filled before you call Test/Wait.
So what is happening under the covers is that for a long message, LAM
is sending an envelope right away. This envelope is received and is
what is matched in order (hence, the envelope for B is matched
appropriately with the Irecv for B). Later, the sender sends a second
envelope for C and matched (with the Irecv for C). Since LAM is single
threaded, the transfer doesn't actually occur until you dip down into
the progression engine. More specifically, the transfer won't occur
for a long message until you Test/Wait on the specific request.
This is in sharp contrast to short messages, which are sent eagerly
(i.e., envelope plus payload). So those can arrive arrive (and
potentially be in the destination buffer) before you call Test/Wait
(this is also influenced by OS-level buffering -- sending lots and lots
of short messages eagerly may end up blocking because the OS buffering
is full, for example). But keep in mind that this "early delivery"
behavior is a side effect of the implementation -- you can't assume it.
MPI says that you must call Test/Wait before looking at the receive
buffer.
Let me put this in concrete terms (assuming TCP, and a default
short/long size of 64k):
Time Event
0 Isend(A), short: sends envelope plus payload
1 Isend(B), long: sends envelope only
2 Isend(C), short: sends envelope plus payload
3 Isend(D), short: sends envelope plus payload
4 Isend(E), short: sends envelope plus payload
--> for the sake of simplicitly, assume that the receiver starts at t=5
-- all the above is "on the wire" and waiting to be received
5 Irecv(A), expecting a short: receive envelope, matches,
receive message into target buffer, return (i.e., this all happened in
the call to Irecv)
6 Irecv(B), expecting a long: receive envelope, matches, sends
back an ACK
7 Irecv(C), expecting a short: receive envelope, matches,
receive message into target buffer, return (i.e., this all happened in
the call to Irecv)
sender receives ACK for B, sends 2nd envelope for B followed
by B's payload
8 Irecv(D), expecting a short: receive envelope, matches,
receive message into target buffer, return (i.e., this all happened in
the call to Irecv)
--> it's a race condition here as to whether the 2nd envelope
for B has arrived yet and will be seen in the single call to Irecv --
let's assume it hasn't arrived yet
9 Wait(C): this message is already in the target buffer; LAM
will simply mark the request as complete -- there's no need to even dip
into the progression engine; we can quickly return because the request
of interest was complete
Hence, although the sender may have actually sent the message (B), we
haven't gone into the progression engine to receive it.
So think of it this way: MPI guarantees the *matching* in order, even
if the physical delivery isn't in order. How the implementation
actually performs the physical delivery is not specified (and thank
goodness! :-). That, plus the fact that MPI says you are not allowed
to look in buffers from non-blocking receives before calling Test/Wait.
> As a slight variation of the pseudo-code, say the compute statement on
> n0 instead read "compute(A+C)", and the rest of the code remained the
> same i.e. no additional Wait statements than those in the original.
> In our execution, A and C have arrived, but B has not.
Correct. But it's still an invalid MPI program. ;-)
More specifically: it's happening that way because of the way that LAM
is implemented. With a different MPI (perhaps one with a smaller
amount of buffering -- or one of LAM's other RPI modules!), A and/or C
could be a long message, and you wouldn't get these [lucky] semantics.
Heck, it's even legal to defer *all* physical delivery until Test/Wait
(i.e., not even send any envelopes eagerly).
> In this case, does performing "compute(A+C)" (successfully even)
> break MPI semantics since B isn't present yet (and C overtook B)?
No.
The whole issue is that C did not overtake B in terms of MPI semantics.
It did in terms of physical delivery, but that's not the issue here.
Put differently: it doesn't matter what order you call the Test/Wait's
in. It matters in what order you post the sends / receives.
LAM is correct because no matter what order you call the Wait's in, the
Irecv(A) will always receive the message sent by Isend(A), the Irecv(B)
will always receive the message sent by Isend(C), ...etc.
Make sense?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|