Bogdan Costescu wrote:
>On Tue, 30 Aug 2005, Pierre Valiron wrote:
>
>
>
>>This behaviour is very annoying for scripting batch jobs.
>>
>>
>
>I have used a similar approach (lamboot immediately followed by mpirun
>then lamhalt) in a wrapper script that executed under SGE, Torque
>(using their native start-up mechanisms) and with simple rsh/ssh and
>never encountered such a problem.
>
>>From the error message, I understand that it's mpirun that fails
>somehow to start the job; the daemons should be properly started at
>that point, otherwise I think that the error message would be
>different.
>
>Can you try using the -s option of mpirun ? This makes mpirun not to
>rely on NFS (or whatever shared FS you are using) to provide the
>program, but copies it itself from the first node. It is mentioned in
>the mpirun man page and I have experienced it myself with NFS that if
>the program is freshly produced (as a result of a compile/link
>process), there might be errors trying to execute it immediately.
>Another way to prove this is to not execute your program directly but
>wrap it with a shell script that does some 'echo' then execs your
>program - if you get the echos from all nodes, it means that mpirun
>did try to start the job on all nodes and it's the program itself that
>doesn't run properly to reach MPI_Init.
>
>
Well, I finally found the problem was related to the behaviour of MPI_INIT.
The code snippet below is buggy when started ever many nodes and procs:
call MPI_Init(err)
call MPI_Comm_rank(MPI_COMM_WORLD,me,err)
call MPI_Comm_size(MPI_COMM_WORLD,nprocs,err)
(some work)
call MPI_Finalize(err)
end
If I include
call MPI_Barrier(MPI_COMM_WORLD,err)
right after MPI_Init, all problems disappear.
I could not exactly what has been cured by the MPI_Barrier call. Fix a
wong MPI_Comm_rank or MPI_Comm_size, or a not fully functional MPI
environment, hard to say as one process dies before writing anything...
Using mpirun -s reduces the occurence of the bug, but does not provide a
cure. For some unknown reason, adding a sleep after lamboot also helps.
Very strange.
Pierre.
>
>
>>We are very proud of our fast OAR batch system, which starts a 100
>>proc job in a second, and we don't want to introduce unneeded
>>delays.
>>
>>
>
>I never heard of OAR, so thanks for mentioning it !
>
>
>
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr/~valiron/
_/ _/ _/ Mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
|