Let me see if I understood you properly,
If the buffer size is too long, MPI is unable to "start" all the
communications, so there are a lot of pending "startings". Then, if I
call some times MPI_Test inside my calculous, the processor is able to
start some of the pending sends/receives.
In this case, I should study the cost of calling MPI_Test, because it
might reduce the performance I would increase with nonblocking
communications.
Do you suggest to call MPI_Test some times spreadly while the program is
calculating in step 4?. What would happen if I call MPI_Test a couple of
times just between 3 and 4?. This way, I can use a loop and dynamically
adjust the amount of times MPI_Test is called based on the buffer size.
Where can I find some deeper documentation about the MPI implementation
of nonblocking operations?. I need to know more about it.
Thanks a lot.
> -----Original Message-----
> From: lam-bounces_at_[hidden]
> [mailto:lam-bounces_at_[hidden]] On Behalf Of Brian Barrett
> Sent: Sunday, August 24, 2003 7:24 PM
> To: General LAM/MPI mailing list
> Subject: Re: LAM: RE: Performance on MPI using nonblocking
> communications
>
>
> On Wednesday, August 13, 2003, at 10:43 AM, Pablo Milano wrote:
>
> > Brian, I don´t understand the "progress behaviour" you talk
> > about. My program, basically works as follows:
> >
> > 1) Initialize data
> > 2) Start calculus until reaching some condition
> > 3) Start all the nonblocking send and receive operations
> (34 sends and
> > 34 receives per process).
> > 4) Continue the calculus (this step takes much more time than 2)
> > 5) Wait for all the nonblocking operations to complete
> > 6) If more calculus is needed, then go to 2
> >
> > The blocking approach is exactly the same (same code) but
> in 3 the send
> > and receive operations are blocking and, of course, step 5
> is omitted.
>
> For long messages, here's what is happening. It step 3, LAM is
> scrambling to get all the messages started - sending out the headers,
> receiving other headers, things like that. Perhaps one or two early
> messages get most of the way through, but even that's unlikely. So
> there are lots of pending messages for LAM to deal with when
> you enter
> step 4. While you are working on calculations, LAM can't make any
> progress on the messages - the entire app is single threaded and you
> aren't entering into LAM at any point. So then you enter step 5 and
> try to complete all the messages. Since there are so many
> pending, all
> trying to complete at the same time, things get backed up and take
> longer than with MPI_SEND/MPI_RECV.
>
> The behavior described above is for TCP. With LAM 7.0, the
> Myrinet/gm
> interface is capable of some progress "in the background" (ie, while
> you are doing computation). But, of course, Myrinet isn't cheap and
> there are still instances where LAM will have to block.
> Commercial MPI
> implementations also generally have some way of making progress on
> non-blocking communication while the user is performing computation,
> but again, those aren't cheap options.
>
> To better hide the impact of non-blocking communication, throwing the
> occasional MPI_Test in during computation can prevent the
> huge spike of
> communication in step 5. LAM will try to make progress on
> messages by
> refilling the TCP buffers (basically - not quite that simple) every
> time you enter MPI_Test, so the communication cost should be better
> hidden.
>
>
> Hope this helps,
>
> Brian
>
> --
> Brian Barrett
> LAM/MPI developer and all around nice guy
> Have a LAM/MPI day: http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|