On May 12, 2005, at 8:17 AM, Pierre Valiron wrote:
> I am worried about possible memory requirements for collective
> operations, namely MPI_Bcast and MPI_Reduce, when very large
> buffers are broadcasted or reduced.
LAM's implementations of MPI_Bcast require only enough additional
memory to create an MPI_Request for each node a given node
communicates to (approximately log(num ranks in communicator) ). The
communication is all done using the user-supplied buffers. For
MPI_Reduce, we use an internal buffer big enough to receive
approximately 2x the size of the message being sent for reduction.
> In my application the buffer size is close to the available core
> per processor.
> I have found no problems so far using LAM/MPI over ethernet.
> However I have run into big trouble on some vendor's MPI (in
> particular on IBM supercomputers) when using the dedicated Colony
> or Federation switches. I have also experienced some weird problems
> with LAM/MPI and myrinet on itaniums, but it was harder to pinpoint
> the trouble.
>
> A simple cure is to chop the collective operation in smaller
> chunks. But there is no obvious choice for the chunk size... In
> addition I guess it would be better to leave the MPI implementation
> perform its own collective optimization and do the best job with
> available core memory and latency and throughput of the available
> interconnect.
For the MPI_Bcast, you should be fine with LAM regardless of send
size. With some of the high speed interconnects (Myrinet /
InfiniBand), there may be some issues on max message size, but it
should be very obvious when LAM reaches those limits. With
MPI_Reduce, if you are reducing very large arrays of items, you may
run into some difficulties with memory usage causing you to start to
swap. You might want to watch the systems to make sure you don't
start swapping...
> Do you specifically take care of such issues within LAM/MPI 7.1.1
> with the various supported interconnects ? What do you plan with
> OpenMPI ?
In LAM, we do very little interconnect-specific with our collectives
and do even less to worry about large messages. We try to keep our
internal memory usage to a minimum, but don't do fragmenting or
anything like that to reduce memory usage. Similar story right now
with Open MPI, although we obviously plan on adding more interconnect-
specific collective support as time goes on.
Hope this helps,
Brian
|