On Tue, 5 Aug 2003, etienne gondet wrote:
> I am trying on a IBM SP4 at ECMWF to spawn an executable from one
> node to another one. It works with lam 6.5.9 but not with 7.0 .
> Furthermore I test it on a cluster of PC where it works with lam 7.0 the
> strange points is lamboot -v boot_schema works but it just deadlock at
> the spawn time during the mpirun -c 1 .driver command.
I'm a little confused here -- you say that lamboot hangs during mpirun?
lamboot and mpirun are effectively unrelated. Do you mean that mpirun
hangs during "mpirun -c 1 ./driver"?
> The idea is to make the driver spawn a block executable where it is
> indicated in block.where. and if this file is not there, then the
> spawned block stay on the same node that the driver (what we called
> intranode and is working).
The problem is twofold:
- in the case where you do not use an app schema, you have the maxprocs
argument of MPI_Comm_spawn set to 1, meaning that only one "block"
executable is launched.
- when only 1 block executable is launched, it tries to ping-pong with
itself (i.e., send to 0 and then receive from 0). Since all of your MPI
point-to-point calls are blocking, this can cause deadlock depending on
how big the messages are and how much the MPI implementation buffers
messages. LAM's tcp RPI gives 64k of buffering by default. If you exceed
this, LAM will use long message protocols, and wait for MPI_Send's to be
ACK'ed by the corresponding MPI_Recv before returning (which will never
happen in this case). Hence, deadlock.
The quick fix is to change the maxprocs arguments of the two
non-app-schema MPI_Comm_spawn calls. If you really do want to be able to
check ping-pong latency/bandwidth of send-to-self, then you'll need to
change block.c to use non-blocking MPI communication.
> There is 2 athers little points lamhalt start but hang never finishing.
> 35:> lamhalt
>
> LAM 7.0/MPI 2 C++/ROMIO - Indiana University
lamhalt will hang for a while if any of the LAM daemons have already died.
> The last little point is that mpiexec is not installed
> The ecmwf support try tracking the matter and they told me there is
> something wrong with their perl
> version :
> tools/mpiexec/mpiexec -testI get
>
> Can't locate File/Temp.pm in @INC (@INC contains: /usr/opt/perl5/lib/5.6.0/aix
> /usr/opt/perl5/lib/5.6.0
This is because mpiexec requires the File::Temp perl module, which is
standard in perl 5.8. For the moment, if you wish to use mpiexec, you'll
need to upgrade your perl to 5.8. But I wouldn't worry about this -- all
the functionality that is available through mpiexec is also available
through mpirun, lamboot, lamhalt, etc.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|