On Sun, 15 Jun 2003, Peter Skaarup wrote:
> At 65536/65537 there is a change in performance. A quick search through
> the source code (version 6.5.9) revealed that the there is a line
> #define LAM_TCPSHORTMSGLEN 65536 and that define is used in
> rpi/tcp/rpi_tcp.c, where it is used to select the code for short message
> '<= 65536' or long message '> 65537'.
>
> My question is, why do you define this as the border between short
> or long messages? It seems that TCP fracments the MPI message, but
> into bits that fit the MTU size, which is about 1500 bytes on most
> Ethernets. And with a message of 65536 bytes (a short one by
> definition) 65560 bytes will be passed to the TCP layer of the
> communication because of teh 24 bytes MPI header.
The change in performance is due to two different protocols being used in
MPI_Send - a "short" protocol and a "long" protocol. As you discovered,
the cross-over point is just about 64K - which happens to be around the
maximum size you can assume to be able to set a buffer on a TCP socket
across all the platforms on which LAM runs.
The long protocol does not start sending the actual data until a matching
receive has been posted on the other side. While there are a bunch of
reasons for this behavior, one of the more compelling reasons is the
buffer behavior of MPI. Without the long protocol, LAM could be put in a
situation where it has to receive a large message off the TCP connection
with no user buffer to put it in - resulting in the MPI implementation
malloc()ing a rather large amount of memory (and later memcpy()ing that
message into the user buffer). This really isn't a great idea and can
kill performance. Hence, the long protocol.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|