Ok, I think I see what is happening here.
BSEND *does* actually make progress when you call it, but not
necessarily enough to finish an entire message. This is not
necessarily related to short vs. long -- it's just a question of how
much progress you want your MPI to make before returning "immediately".
LAM takes the approach (in the TCP RPI) that it will try to select()
on sockets that it's trying to write on and if they're available for
writing, it'll take one swipe at it with the current message that is
being sent by setting the socket to non-blocking mode and calling
write() or writev().
You can see this effect by changing your message size upwards from 100
-- LAM's TCP short message size default is 64k, so if you increment 100
1,000 10,000 -- they all go pretty much right away. When you go to
100,000, you can see that the first message is sent after the second
BSEND -- which makes sense; the receiver has had time to ACK the
rendezvous protocol, and the sender now sends it. This timing will
probably stay more-or-less constant, especially with a high-latency
transport like TCP.
However, when you increase to 1,000,000 -- it just takes longer to send
the full message, particularly when the default short message size is
64K, because LAM therefore setup kernel socket buffering to 64k, and
the kernel internally effectively fragments and only sends part of the
~1M message at a time. Adding a few printf's down in the progression
engine, I can see that progress *is* occurring, just very slowly. The
DETACH effectively is a "wait" on buffered sends, so they all get sent
during that time.
Here's the output from a typical run (with printf's down in the sending
progress engine):
-----
[12:17] eddie:~/mpi % mpirun n0,1 -ssi rpi tcp bsend -b 1000000
Buffer size is 1000000
sending message 1 at Thu May 12 12:17:46 2005
sending message 2 at Thu May 12 12:17:47 2005
Sent 104232 message bytes
sending message 3 at Thu May 12 12:17:48 2005
Sent 105704 message bytes
sending message 4 at Thu May 12 12:17:49 2005
Sent 105704 message bytes
sending message 5 at Thu May 12 12:17:50 2005
Sent 105704 message bytes
sending message 6 at Thu May 12 12:17:51 2005
Sent 105704 message bytes
sending message 7 at Thu May 12 12:17:52 2005
Sent 105704 message bytes
sending message 8 at Thu May 12 12:17:53 2005
Sent 105704 message bytes
sending message 9 at Thu May 12 12:17:54 2005
Sent 105704 message bytes
sending message 10 at Thu May 12 12:17:55 2005
Sent 105704 message bytes
sender detaches buffer at Thu May 12 12:17:59 2005
Sent 50136 message bytes
****** message 1 received at Thu May 12 12:17:59 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 2 received at Thu May 12 12:17:59 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 3 received at Thu May 12 12:17:59 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 4 received at Thu May 12 12:17:59 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 5 received at Thu May 12 12:18:00 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 6 received at Thu May 12 12:18:00 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 7 received at Thu May 12 12:18:00 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 8 received at Thu May 12 12:18:00 2005
Sent 105680 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 37648 message bytes
Sent 28416 message bytes
****** message 9 received at Thu May 12 12:18:00 2005
Sent 1000000 message bytes
****** message 10 received at Thu May 12 12:18:00 2005
-----
You can see that the OS chooses to buffer things a little oddly for the
last several messages (LAM always tries to send the remaining message
-- it doesn't fragment). Gotta love Linux. :-)
Note that increasing the small message size (and therefore also
increasing the OS socket buffering), but still forcing a long message
(i.e., forcing LAM to use a rendezvous protocol), provides at least
*some* better performance -- more of the message gets sent out during
each call to BSEND:
-----
[12:20] eddie:~/mpi % mpirun n0,1 -ssi rpi tcp -ssi rpi_tcp_short
1000000 bsend -b 1000001
Buffer size is 1000001
sending message 1 at Thu May 12 12:20:40 2005
sending message 2 at Thu May 12 12:20:41 2005
Sent 211384 message bytes
sending message 3 at Thu May 12 12:20:42 2005
Sent 211408 message bytes
sending message 4 at Thu May 12 12:20:43 2005
Sent 211408 message bytes
sending message 5 at Thu May 12 12:20:44 2005
Sent 211408 message bytes
sending message 6 at Thu May 12 12:20:45 2005
Sent 154393 message bytes
****** message 1 received at Thu May 12 12:20:46 2005
sending message 7 at Thu May 12 12:20:46 2005
Sent 211384 message bytes
sending message 8 at Thu May 12 12:20:47 2005
Sent 211408 message bytes
sending message 9 at Thu May 12 12:20:49 2005
Sent 211408 message bytes
sending message 10 at Thu May 12 12:20:50 2005
Sent 211408 message bytes
sender detaches buffer at Thu May 12 12:20:54 2005
Sent 154393 message bytes
****** message 2 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 3 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 4 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 5 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 6 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 7 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 8 received at Thu May 12 12:20:54 2005
Sent 212832 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 72400 message bytes
Sent 63169 message bytes
****** message 9 received at Thu May 12 12:20:54 2005
Sent 1000001 message bytes
****** message 10 received at Thu May 12 12:20:54 2005
-----
Keep in mind that this is for TCP. With Myrinet/GM, for example, your
performance characteristics would likely be a bit better because there
are no partial sends in GM (i.e., the Myri NIC will send the entire
message -- it'll do internal fragmenting, but it has its own
communications co-processor, so it'll keep making progress, even during
the sleep(1)).
Make sense?
On May 12, 2005, at 12:27 PM, Stephan Mertens wrote:
> Jeff:
>
> Thanks for running my program (the bug was a transcription error).
> You are right, MPI_Bsend does trigger progress sometimes,
> but apparently not always. A new version of my program (attached)
> allocates enough buffer to prevent overflow in any case and prints out
> a mark if the sender calls MPI_Buffer_detach.
>
> Here is an extreme example (with sleep reduced to 1 second)
>
> leonardo:~/projects/mpi/src$ mpirun n0,1 -ssi rpi tcp Bsend -b 1000000
> Buffer size is 1000000
> sending message 1 at Thu May 12 17:49:29 2005
> sending message 2 at Thu May 12 17:49:30 2005
> sending message 3 at Thu May 12 17:49:31 2005
> sending message 4 at Thu May 12 17:49:32 2005
> sending message 5 at Thu May 12 17:49:33 2005
> sending message 6 at Thu May 12 17:49:34 2005
> sending message 7 at Thu May 12 17:49:35 2005
> sending message 8 at Thu May 12 17:49:36 2005
> sending message 9 at Thu May 12 17:49:37 2005
> sending message 10 at Thu May 12 17:49:38 2005
> sender detaches buffer at Thu May 12 17:49:42 2005
> ****** message 1 received at Thu May 12 17:49:42 2005
> ****** message 2 received at Thu May 12 17:49:42 2005
> ****** message 3 received at Thu May 12 17:49:42 2005
> ****** message 4 received at Thu May 12 17:49:42 2005
> ****** message 5 received at Thu May 12 17:49:42 2005
> ****** message 6 received at Thu May 12 17:49:42 2005
> ****** message 7 received at Thu May 12 17:49:42 2005
> ****** message 8 received at Thu May 12 17:49:42 2005
> ****** message 9 received at Thu May 12 17:49:42 2005
> ****** message 10 received at Thu May 12 17:49:42 2005
>
> As you can see, all messages are delivered not before the sender
> forces them out by detaching the buffer. For other message sizes I
> observe scenarios similar to yours.
>
> We are using LAM 7.1.1 on a 2.6.10 SMP kernel (see laminfo below),
> the two nodes above are linked by GBit ethernet.
>
> Of course we don't use Bsend in any serious application. This is just
> a pedagogical study for a <ad>book on "cluster computing"</ad>
> that I am coauthoring :-)
>
> Cheers,
> Stephan
>
> leonardo:~/projects/mpi/src$ laminfo
> LAM/MPI: 7.1.1
> Prefix: /usr
> Architecture: i686-pc-linux-gnu
> Configured by: root
> Configured on: Wed Apr 13 17:29:57 CEST 2005
> Configure host: hal
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
> --
> Stephan Mertens @ http://www.uni-magdeburg.de/mertens
> Supercomputing in Magdeburg @
> http://tina.nat.uni-
> magdeburg.de<Bsend.c>_______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|