Thanks for the info Brian (not to mention all the great work on LAM &
OpenMPI!)
This was a new app being implemented and we were using the same mpirun
parameters
as a previous application. The tcp mode is indeed consistently an order
of magnitude
faster on this new app. We have been watching OpenMPI development, and
will hopefully
be implementing it before the end of the year.
Best Regards,
Mac McCalla
Geoscience Systems Consultant
Hess Corporation
500 Dallas St. , Houston, Texas 77002
-----Original Message-----
From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of Brian Barrett
Sent: Wednesday, June 07, 2006 10:46 PM
To: General LAM/MPI mailing list
Subject: Re: LAM: Request assistance with trouble shooting a failure
On Jun 5, 2006, at 2:30 PM, McCalla, Mac wrote:
> I have been trying to figure this problem out for a couple of days
> now and have given myself a headache, so
> would appreciate some guidance.
> <<lam712-doc.tar.gz>>
> Lam release is 7.1.2. Systems are RHEL WSR3.
>
> There are many successful examples of the original application as
> well as the test one, running on different node sets.
>
> The problem scenario is 360 processes on 294 nodes. The test
> application is a slight variant of the lam/examples/hello/hello.c
>
> program. (As you might guess, this was not the original
> application). lamboot is successful.
>
> the mpirun specifying "-ssi rpi lamd" fails - issuing
>
> No output from the application itself appears, suggesting to me
> that the MPI_Init is never completed for rank 0?
> Output captured from the run is in the mpirun_lamd.log.t1b file.
>
> the mpirun specifying "-ssi rpi tcp" succeeds without an
> intervening lamboot .
>
> Is it possible that the issue is overloading of the lamd on the
> master node somehow?...
It looks like we're failing internally in the lamd rpi when using
gatherv at scale. This isn't totally unexpected -- the lamd rpi is
really not supported anymore, and definitely does not work well at
more than very small numbers of nodes. I'd recommend using TCP
instead of lamd -- your performance will be much better, and you'll
avoid the scaling failures that can occur in the lamd rpi.
If you are using the lamd rpi to try to get overlap of computation
and communication, you might want to look at Open MPI. Your
performance will be much better, and we support better overlap of
computation and communication with the TCP and shared memory
transports if progress threads are enabled.
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|