LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pierre Valiron (Pierre.Valiron_at_[hidden])
Date: 2005-08-30 16:09:16


Bogdan Costescu wrote:

>On Tue, 30 Aug 2005, Pierre Valiron wrote:
>
>
>
>>This behaviour is very annoying for scripting batch jobs.
>>
>>
>
>I have used a similar approach (lamboot immediately followed by mpirun
>then lamhalt) in a wrapper script that executed under SGE, Torque
>(using their native start-up mechanisms) and with simple rsh/ssh and
>never encountered such a problem.
>
>>From the error message, I understand that it's mpirun that fails
>somehow to start the job; the daemons should be properly started at
>that point, otherwise I think that the error message would be
>different.
>
>Can you try using the -s option of mpirun ? This makes mpirun not to
>rely on NFS (or whatever shared FS you are using) to provide the
>program, but copies it itself from the first node. It is mentioned in
>the mpirun man page and I have experienced it myself with NFS that if
>the program is freshly produced (as a result of a compile/link
>process), there might be errors trying to execute it immediately.
>Another way to prove this is to not execute your program directly but
>wrap it with a shell script that does some 'echo' then execs your
>program - if you get the echos from all nodes, it means that mpirun
>did try to start the job on all nodes and it's the program itself that
>doesn't run properly to reach MPI_Init.
>
>
Dear Bodgan, seemingly NFS is not the cause of the problem. The
executable is sitting there since a while. I also tried mpirun with the
-s option as you suggested, and the problem either remained identical or
showed up differently (probably within the execution of the -s command):

valiron_at_n11 ~ > lamboot $OAR_FILE_NODES ; mpirun -s `hostname` C rotate
; lamhalt

LAM 7.1.1/ROMIO - Indiana University

-----------------------------------------------------------------------------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 192.168.11.11, port 33760.
more blah...

Alternatively I submitted the same task to all the cpus in 30 successive
batch jobs, after inserting a sleep 30 after the lamboot, and keeping
the plain command 'mpirun C rotate'. Then all the 30 batch jobs ran fine.

The key point seems the delay added after lamboot. Strange...

>
>
>>We are very proud of our fast OAR batch system, which starts a 100
>>proc job in a second, and we don't want to introduce unneeded
>>delays.
>>
>>
>
>I never heard of OAR, so thanks for mentioning it !
>
>
Have a look to http://oar.imag.fr/ and http://ka-tools.sourceforge.net/
for the Monika status page on the web.
The documentation is the weak point unfortunately. However the system is
small, modular and easy to understand, maintain or extend. We use it
also as a tool for our local grid engines.

Pierre.

-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
       _/_/_/_/    _/       _/       Dr. Pierre VALIRON
      _/     _/   _/      _/   Laboratoire d'Astrophysique
     _/     _/   _/     _/    Observatoire de Grenoble / UJF
    _/_/_/_/    _/    _/    BP 53  F-38041 Grenoble Cedex 9 (France)
   _/          _/   _/    http://www-laog.obs.ujf-grenoble.fr/~valiron/
  _/          _/  _/     Mail: Pierre.Valiron_at_[hidden]
 _/          _/ _/      Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/          _/_/