On Aug 12, 2005, at 11:11 AM, Pierre Valiron wrote:
> The alltoall implementation you describe seems a bit hairy... If I
> understand it properly it is by no way scalable. The total number of
> data packets to be exchanged grow as Nproc^2, sure, but in this naive
> implementation there no linear boundary on the number of pending
> packets. If you have 100 processors or more it seems very risky to
> scatter 10000 (large) packets or more in one shot across the system.
Correct.
> As I understand it, setting gigabit hardware flow control may help the
> OS to limit the packet explosion. This would explain why the MPI
> machine
> is less prone to freeze in this case.
Sounds reasonable.
> What is the status of the other collective operations in lam/mpi ? Is
> this "explosive" behaviour unique to alltoall ?
For the most part, yes. Most other algorithms use a lograthmic
approach. But alltoall is quite difficult to optimize (there is other
work in this area, but it has not been ported to LAM/MPI).
> I could imagine easily how to write another naive alltoall on top of
> Isend and Irecv which would limit the number of pending requets to a
> few
> per node with no serious performance pay-off. Is this kind of "safe"
> algorithm has already been written by MPI gurus ? Is it planned for
> OpenMPI ? If not, I am willing to write a demonstration code.
I should mention that both LAM and Open MPI use a component
architecture for their collective algorithm implementations. As such,
it's quite easy to drop in a new algorithm.
We have not had the time to write out a better alltoall [yet]; if you
wanted to do a little work in this area, that would be fantastic.
Probably the easiest thing to do would be to simply replace the current
alltoall algorithm -- it's in
share/ssi/coll/lam_basic/src/ssi_coll_basic_lam_basic_alltoall.c
(ignore the *_lamd() function). The code is relatively simply -- it
just creates a bunch of persistent requests, issues MPI_Startall(), and
then MPI_Waitall(). The rationale for using persistent requests was to
create all the setup first and *then* initiate all the communication
(i.e., don't mix the setup with potential communication delays). This
could be a moot point, however, especially for a better algorithm.
Any work that you do here will also be pretty much directly applicable
to Open MPI (its first generation collective component framework is
almost identical to LAM's).
If you configure LAM/MPI with --enable-shared --disable-static, then
you can easily re-build/re-install the lam_basic coll component
directly from the share/ssi/coll/lam_basic directory (i.e., a simple
"make all install" in there) rather than having to re-link the entire
libmpi MPI library itself. This tends to save a lot of time during
debugging (you should probably "make uninstall" the static installation
first, however, to prevent confusion between the static and dynamic
installs of LAM/MPI).
Let us know what you find.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|