On Jun 5, 2006, at 2:30 PM, McCalla, Mac wrote:
> I have been trying to figure this problem out for a couple of days
> now and have given myself a headache, so
> would appreciate some guidance.
> <<lam712-doc.tar.gz>>
> Lam release is 7.1.2. Systems are RHEL WSR3.
>
> There are many successful examples of the original application as
> well as the test one, running on different node sets.
>
> The problem scenario is 360 processes on 294 nodes. The test
> application is a slight variant of the lam/examples/hello/hello.c
>
> program. (As you might guess, this was not the original
> application). lamboot is successful.
>
> the mpirun specifying "-ssi rpi lamd" fails - issuing
>
> No output from the application itself appears, suggesting to me
> that the MPI_Init is never completed for rank 0?
> Output captured from the run is in the mpirun_lamd.log.t1b file.
>
> the mpirun specifying "-ssi rpi tcp" succeeds without an
> intervening lamboot .
>
> Is it possible that the issue is overloading of the lamd on the
> master node somehow?...
It looks like we're failing internally in the lamd rpi when using
gatherv at scale. This isn't totally unexpected -- the lamd rpi is
really not supported anymore, and definitely does not work well at
more than very small numbers of nodes. I'd recommend using TCP
instead of lamd -- your performance will be much better, and you'll
avoid the scaling failures that can occur in the lamd rpi.
If you are using the lamd rpi to try to get overlap of computation
and communication, you might want to look at Open MPI. Your
performance will be much better, and we support better overlap of
computation and communication with the TCP and shared memory
transports if progress threads are enabled.
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|