LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2004-05-29 13:20:44


On May 28, 2004, at 8:26 PM, Mike Arnold wrote:

> another one for you Brian regarding how non-blocking receives and
> sends are handled for the rendezvous protocol:
>
> Consider the case where the receiving process posts a receive using
> MPI_IRecv() well _before_ the sending process calls MPI_Send() or
> MPI_BSend().
>
> With the rendezvous protocol, does MPI_Send() have to wait until the
> receiver calls MPI_Wait(), in order to get an ack from the receiver
> and start the data transfer and then return? Or will MPI_IRecv()
> already have sent an ack in anticipation of a future send request
> being made?

The MPI_Send() will have to wait until the process that has posted the
MPI_Send has entered into an MPI function that does communication (it
could be another send - LAM will progress all pending communication
whenever any communication function is called).

> ie, for a rendezvous protocol, can the different stages in the receive
> (such as the ack) be progressed by either MPI_IRecv or MPI_Wait
> depending on whether there is a send request pending? Does MPI_IRecv
> always post an ack regardless (ie a "ready" rather than an "ack")? And
> can the different stages only be progressed when the receiving process
> is in one of these two calls?

No - The ack is a CTS, meaning that the receiving side has somewhere to
put the incoming message and is ready to receive the message (and deal
with truncation errors and the like). I suppose that you could
implement a system where you ready sent the acks for Irecvs that had a
specified source, but the structure of our progression engine does not
permit this.

> Similarly, if the sender uses MPI_ISend() instead of MPI_Send(), and
> there is already an ack there waiting from the receiver from a prior
> MPI_IRecv, does MPI_ISend progress to the next stage and start the
> data transfer, effectively completing the communication from the
> sender's perspective.

Since we don't pre-post ACKs, it doesn't quite work like that. We
still need for the MPI_Irecv to say "yeah, I've got room for that
message" before we start shoving the data around.

> With an eager protocol is the send operation purely local?

With the eager protocol, sends (with the exception of MPI_Ssend) are
basically local. They give the message to the kernel's tcp stack and
forget about it. The kernel deals with the rest.

I should point out that the discussion above is for the TCP transport.
The details for the Myrinet/gm and upcoming InfiniBand transports are a
little bit different, but the general concepts are the same. The
shared memory transport requires interactions from both parties for
pretty much every send, but if you have enough shared memory available,
the amount of lockstepping done isn't too horrible.

Hope this helps,

Brian

> Brian W. Barrett wrote:
>
>> On Mon, 24 May 2004 mikea_at_[hidden] wrote:
>>> would somebody please be able to clarify for me the finer
>>> distinctions
>>> between the different send and receive calls (buffered and standard,
>>> blocking and non- blocking) in terms of the time taken by the calling
>>> process. That is, I'm not concerned about the time taken to complete
>>> the
>>> entire communication, but only the time that the calling process must
>>> spend inside the mpi calls. I'm aware of what the mpi standard
>>> states,
>>> but its not clear to me how it is interpreted within lam-mpi given
>>> that
>>> it is single-threaded and presumably does not have internal
>>> "progress"
>>> threads. My understanding of the lam documentation is that the
>>> communication only progresses when the calling process is inside an
>>> mpi
>>> call, hence some of the advantages of non-blocking and buffered
>>> disappear. Specifically I'm guessing that:
>>>
>>> 1. Buffering should ensure that send() can return as soon as the
>>> message
>>> is buffered, but with lam I presume that a buffered send() should
>>> take
>>> at least as long as a non-buffered send() to return, as in both cases
>>> the call can not return until the sending process has fully completed
>>> with the message.
>> So this is a really hard question to answer, but just some general
>> notes... MPI_Send and MPI_BSend both use an eager prototcol for
>> sending
>> short messages. After some point, a rendezvous protocol is used that
>> requires an ACK from the receiving side before sending the actual
>> data.
>> MPI_Ssend always uses the rendezvous protocol, and MPI_Isend does
>> something similar to MPI_Send and MPI_Bsend.
>> Note that just because a send has returned does *NOT* mean the
>> message has
>> been sent. For MPI_Send and MPI_Bsend, a return means that the
>> supplied
>> buffer can be reused by the user. For MPI_Ssend, a return means that
>> the
>> receiving side has *started* to receive the message. For MPI_Isend,
>> all
>> that the function returns means is that the send has started.
>>> 2. Similarly for lam, the combined time spent in a non-blocking
>>> send()
>>> and its matching wait() will be at least that spent in a blocking
>>> send(), regardless of the time interval between isend() and wait(),
>>> unless lam uses "progress" threads. The same applies for receive().
>> Not necessarily. It could be more, the same, or less. Which isn't
>> really
>> a useful answer, but let's look at the TCP case:
>> same: this is fairly trivial - you guessed how this would work...
>> less: remember that the OS does some buffering for us. Let's say
>> that we
>> are sending a rendezvous protocol send. During the ISend call, we
>> send
>> the RTS and return. During the Wait, we grab the CTS that is waiting
>> for
>> us already and start sending data. We manage to skip out on an entire
>> round trip for the RTS/CTS protocol. There are also some buffering
>> cases
>> that make Isends faster as well...
>> mori: If you have lots of Isends posted, you can actually slow things
>> down
>> in the bookkeeping as LAM tries to progress all the messages. This is
>> especially true on the shared memory protocols, if I recall correctly.
>>> 3. Using the lamd rpi module, will a standard blocking send() return
>>> once the message has been passed to the local daemon, or only when a
>>> matching receive has also been posted? I'm guessing the former, and
>>> that
>>> the above points also apply to lamd. I can see an advantage in using
>>> lamd, as communication with a process on the local node is cheaper
>>> than
>>> with one on a remote node, but will there be a difference whether the
>>> send is blocking or non-blocking, buffered or standard?
>> Remember that the standard talks about when it is safe to reuse the
>> buffer
>> more than anything. Since you can reuse the buffer as soon as the
>> message
>> is sent to the local lamd, the apparent latency for medium sized
>> messages
>> is often lower than the TCP rpi. However, the realized bandwidth
>> will be
>> much lower, so it isn't necessarily going to be a win.
>> Hope this helps...
>> Brian
>
> --
> -----------------------------------------------------------------------
> Computational Neurobiology Laboratory : Phone: 858 453-4100x1455
> The Salk Institute of Biological Studies : Fax: 858 587-0417
> 10010 N. Torrey Pines Rd, La Jolla, CA 92037 : Email: mikea_at_[hidden]
> : www.cnl.salk.edu/~mikea
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/