LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (Bogdan.Costescu_at_[hidden])
Date: 2005-08-30 10:34:33


On Tue, 30 Aug 2005, Pierre Valiron wrote:

> This behaviour is very annoying for scripting batch jobs.

I have used a similar approach (lamboot immediately followed by mpirun
then lamhalt) in a wrapper script that executed under SGE, Torque
(using their native start-up mechanisms) and with simple rsh/ssh and
never encountered such a problem.

>From the error message, I understand that it's mpirun that fails
somehow to start the job; the daemons should be properly started at
that point, otherwise I think that the error message would be
different.

Can you try using the -s option of mpirun ? This makes mpirun not to
rely on NFS (or whatever shared FS you are using) to provide the
program, but copies it itself from the first node. It is mentioned in
the mpirun man page and I have experienced it myself with NFS that if
the program is freshly produced (as a result of a compile/link
process), there might be errors trying to execute it immediately.
Another way to prove this is to not execute your program directly but
wrap it with a shell script that does some 'echo' then execs your
program - if you get the echos from all nodes, it means that mpirun
did try to start the job on all nodes and it's the program itself that
doesn't run properly to reach MPI_Init.

> We are very proud of our fast OAR batch system, which starts a 100
> proc job in a second, and we don't want to introduce unneeded
> delays.

I never heard of OAR, so thanks for mentioning it !

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]