Jeff Squyres a écrit:
>On Tue, 5 Aug 2003, etienne gondet wrote:
>
>
>
First thank you for intertest.
>> I am trying on a IBM SP4 at ECMWF to spawn an executable from one
>>node to another one. It works with lam 6.5.9 but not with 7.0 .
>>Furthermore I test it on a cluster of PC where it works with lam 7.0 the
>>strange points is lamboot -v boot_schema works but it just deadlock at
>>the spawn time during the mpirun -c 1 .driver command.
>>
>>
>
>I'm a little confused here -- you say that lamboot hangs during mpirun?
>lamboot and mpirun are effectively unrelated. Do you mean that mpirun
>hangs during "mpirun -c 1 ./driver"?
>
>
No I never said that , I said lamboot -v boot_schema with
several nodes in boot_schema works
fine but then : mpirun -c 1 ./driver hangs at the moment to spawn the
executable called block on another
node defined in "block.where" as n1 , so as hpca2502 the second node of
the boot_schema file.
>>The idea is to make the driver spawn a block executable where it is
>>indicated in block.where. and if this file is not there, then the
>>spawned block stay on the same node that the driver (what we called
>>intranode and is working).
>>
>>
>The problem is twofold:
>
>- in the case where you do not use an app schema, you have the maxprocs
>argument of MPI_Comm_spawn set to 1, meaning that only one "block"
>executable is launched.
>
>
I don't understand what twofolds means but on that IBM SP4 the block
is never started on the other
node. I go with rsh on that other nodes and I never see a process called
block with a ps -edf
and the driver processus is in a deadlock. The problem is before the
pingpong and relative process
management not to communication protocol and buffers. I understand
because I spawn a monoprocessus block that it should deadlock later in
the pingpong after 64k. But I just reduced the number of processus in
case of.
With LAM 6.5.9, I can spawn on another node.
>- when only 1 block executable is launched, it tries to ping-pong with
>itself (i.e., send to 0 and then receive from 0). Since all of your MPI
>point-to-point calls are blocking, this can cause deadlock depending on
>how big the messages are and how much the MPI implementation buffers
>messages. LAM's tcp RPI gives 64k of buffering by default. If you exceed
>this, LAM will use long message protocols, and wait for MPI_Send's to be
>ACK'ed by the corresponding MPI_Recv before returning (which will never
>happen in this case). Hence, deadlock.
>
>The quick fix is to change the maxprocs arguments of the two
>non-app-schema MPI_Comm_spawn calls. If you really do want to be able to
>check ping-pong latency/bandwidth of send-to-self, then you'll need to
>change block.c to use non-blocking MPI communication.
>
>
>
>>There is 2 athers little points lamhalt start but hang never finishing.
>>35:> lamhalt
>>
>>LAM 7.0/MPI 2 C++/ROMIO - Indiana University
>>
>>
>>
>
>lamhalt will hang for a while if any of the LAM daemons have already died.
>
>
A very long while.
>>The last little point is that mpiexec is not installed
>>The ecmwf support try tracking the matter and they told me there is
>>something wrong with their perl
>>version :
>>tools/mpiexec/mpiexec -testI get
>>
>>Can't locate File/Temp.pm in @INC (@INC contains: /usr/opt/perl5/lib/5.6.0/aix
>>/usr/opt/perl5/lib/5.6.0
>>
>>
>
>This is because mpiexec requires the File::Temp perl module, which is
>standard in perl 5.8. For the moment, if you wish to use mpiexec, you'll
>need to upgrade your perl to 5.8. But I wouldn't worry about this -- all
>the functionality that is available through mpiexec is also available
>through mpirun, lamboot, lamhalt, etc.
>
>
OK.
- Don't use hcc. Use mpicc. Since you're using an absolute pathname for
the wrapper compiler, then use its preferred name (mpicc). "hcc" is an
outdated name -- it's a sym link to mpicc anyway. "hcc" will likely
disappear in some future release of LAM/MPI.
I used mpicc and got the same deadlock on the spawn
- You really don't need to specify any -I, -L, or -l flags for mpicc. I
commented out LIBPATH, INCPATH, MPILIBS, and FLIBS when I compiled your
program.
Yes I know it is for compatibility with FUJITSU mpi2 which don't have wrappers to handle INC and LIB PATH.
- Not all the flags in make-include are propogated properly into the
Makefile. For example, I change -DSMALL to -DBIG in make-include and was
surprised when it didn't propogate. I had to go manually change the
Makefile to -DSMALL.
Yes sorry it's an old mistake whih reappear , some cvs tag
confiusion probably.
|