LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-05-12 07:07:27


On May 12, 2005, at 6:37 AM, Stephan Mertens wrote:

>> Second: even if you do use BSEND, it should trigger at least some
>> progress every time you invoke BSEND (or most any other communication
>> MPI function).
>
> No, MPI_Bsend does never trigger any progress on pending messages if
> each
> individual message is larger than the short protocol!

I don't think that this is right, at least not in every case. The
progress logic for each RPI is different, but examining the source code
for MPI_BSEND in LAM, it quite definitely always triggers the progress
engine.

What version of LAM are you using? I don't think that we've changed
this logic in a long, long time...?

> This leads to the
> bizarre situation that a sender that is *exclusively* using MPI_Bsend
> will
> never get a message through before he calls MPI_Finalize. Before that
> he will die from buffer exhaustion, of course. Run the program
> below and you will see...

See my results below (note that there is a minor bug in this program).

> [snipped]
> if (myrank == 0) {
> printf ("Buffer size is %ld\n", bufsize*sizeof(char));
> mpibuf = (char*)malloc (3*bufsize*sizeof(char));
> /* let MPI buffer at least 3 messages */
> MPI_Buffer_attach (mpibuf, 4*bufsize*sizeof(char));

Your malloc size and attach size should be the same. As it is shown
here, you'll likely generate a seg fault because you allocate less
memory than you attach (or some other memory badness). I'm guessing,
however, that this is just a transcript error in the e-mail...?

> [snipped rest of source]

I ran this app a few different ways (although I changed the sleep to 1
second because I'm impatient ;-) ). This is with the latest SVN
checkout of LAM/MPI:

# --> Running across 2 hosts, using the tcp RPI
[6:53] eddie:~/mpi % mpirun -np 2 -ssi rpi tcp `pwd`/bsend
Buffer size is 100000
sending message 1 at Thu May 12 06:53:38 2005
sending message 2 at Thu May 12 06:53:39 2005
****** message 1 received at Thu May 12 06:53:39 2005
sending message 3 at Thu May 12 06:53:40 2005
****** message 2 received at Thu May 12 06:53:40 2005
sending message 4 at Thu May 12 06:53:41 2005
****** message 3 received at Thu May 12 06:53:41 2005
sending message 5 at Thu May 12 06:53:42 2005
****** message 4 received at Thu May 12 06:53:42 2005
sending message 6 at Thu May 12 06:53:43 2005
****** message 5 received at Thu May 12 06:53:43 2005
sending message 7 at Thu May 12 06:53:44 2005
****** message 6 received at Thu May 12 06:53:44 2005
sending message 8 at Thu May 12 06:53:45 2005
****** message 7 received at Thu May 12 06:53:45 2005
sending message 9 at Thu May 12 06:53:46 2005
****** message 8 received at Thu May 12 06:53:46 2005
sending message 10 at Thu May 12 06:53:47 2005
****** message 9 received at Thu May 12 06:53:47 2005
****** message 10 received at Thu May 12 06:53:48 2005

# --> Running on one host, using the usysv RPI
[6:54] eddie:~/mpi % mpirun n0 n0 -ssi rpi usysv `pwd`/bsend
Buffer size is 100000
sending message 1 at Thu May 12 06:54:04 2005
sending message 2 at Thu May 12 06:54:05 2005
****** message 1 received at Thu May 12 06:54:05 2005
sending message 3 at Thu May 12 06:54:06 2005
****** message 2 received at Thu May 12 06:54:06 2005
sending message 4 at Thu May 12 06:54:07 2005
sending message 5 at Thu May 12 06:54:08 2005
****** message 3 received at Thu May 12 06:54:08 2005
sending message 6 at Thu May 12 06:54:09 2005
****** message 4 received at Thu May 12 06:54:09 2005
sending message 7 at Thu May 12 06:54:10 2005
****** message 5 received at Thu May 12 06:54:10 2005
sending message 8 at Thu May 12 06:54:11 2005
sending message 9 at Thu May 12 06:54:12 2005
****** message 6 received at Thu May 12 06:54:12 2005
sending message 10 at Thu May 12 06:54:13 2005
****** message 7 received at Thu May 12 06:54:14 2005
****** message 8 received at Thu May 12 06:54:14 2005
****** message 9 received at Thu May 12 06:54:14 2005
****** message 10 received at Thu May 12 06:54:14 2005

# --> Running on one host, using the sysv RPI
[6:54] eddie:~/mpi % mpirun n0 n0 -ssi rpi sysv `pwd`/bsend
Buffer size is 100000
sending message 1 at Thu May 12 06:54:19 2005
sending message 2 at Thu May 12 06:54:20 2005
sending message 3 at Thu May 12 06:54:21 2005
****** message 1 received at Thu May 12 06:54:21 2005
sending message 4 at Thu May 12 06:54:22 2005
sending message 5 at Thu May 12 06:54:23 2005
****** message 2 received at Thu May 12 06:54:23 2005
sending message 6 at Thu May 12 06:54:24 2005
sending message 7 at Thu May 12 06:54:25 2005
****** message 3 received at Thu May 12 06:54:25 2005
sending message 8 at Thu May 12 06:54:26 2005
sending message 9 at Thu May 12 06:54:27 2005
****** message 4 received at Thu May 12 06:54:27 2005
sending message 10 at Thu May 12 06:54:28 2005
****** message 5 received at Thu May 12 06:54:29 2005
****** message 6 received at Thu May 12 06:54:29 2005
****** message 7 received at Thu May 12 06:54:29 2005
****** message 8 received at Thu May 12 06:54:29 2005
****** message 9 received at Thu May 12 06:54:29 2005
****** message 10 received at Thu May 12 06:54:29 2005

So you can see that the messages are always delivered. tcp and usysv
do reasonably well; sysv seemed to deliver a little slower (probably
due to race conditions in the shared memory flow control). There are a
few race conditions in the logic that sometimes mean fewer messages are
delivered before completion.

I'm curious as to why you aren't seeing this behavior. Hmm.

Also, note the LAM man page for MPI_BSEND:

-----
    In C, you can force the messages to be delivered by
    MPI_Buffer_detach(&b, &n);
    MPI_Buffer_attach(b, n);

    (The MPI_Buffer_detach will not complete until all buffered
messages
    are delivered.)

    It is generally a bad idea to use the MPI_Bsend function, as it
guaran-
    tees that the entire message will suffer the overhead of an
additional
    memory copy. For large messages, or when shared memory message
trans-
    ports are being used, this overhead can be quite expensive.
-----

So I would suggest to you again that you probably should not be using
BSEND unless you have a really good reason for it. Normally, you can
allocate a small number of buffers yourself with a freelist and use
non-blocking sends to effect the same behavior, but without the
additional latency of forcing MPI to perform an extra copy.
Admittedly, it's a little more work on your part, but it avoids all
these kinds of problems.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/