This is not really a solution but workaround - if lamhalt;lamboot is all
that is necessary to make the environment work again then you can either
call lamboot before each mpirun and lamhalt after it, or do a more
'sophisticated' version: check mpirun return code and if it is 215 call
lamhalt;lamboot - in this case you lose one run (the failed one), but
overall overhead is significantly smaller. Of course you can automatically
re-run the failed task as well.
if you're running long series of relatively short tasks this approach will
at least ascertain that the series finishes without human intervention and
no time will be lost
thanks,
Konrad
On Wed, 17 Sep 2008, Philippe GOURET wrote:
>
> Hello
>
> We use LAM / MPI just to run some tools (with their mpi version) in a
> bioinformatics pipeline on a multi-processor computer.
>
> mpirun -np 4 ../AlgoTools/tree-puzzle-5.1/src/ppuzzle
>
> and
>
> mpirun -np 4 ../AlgoTools/clustalw-mpi-0.13/clustalw-mpi
>
>
> Normally we get no trouble, but sometimes (and for our lab it's a big
> problem cause we lose hours of computation) we get a 215 error number at
> mpirun exit. At this point any new call to mpirun returns error 215. We
> have do a new lamboot, and then it works again.
>
> (note that calls on some data, sometimes fail and sometimes are successfull)
>
>
> Can you help us ?
>
> Thanks by advance
>
> Best Regards
>
> Philippe Gouret
> Evolutionary Biology and Modeling
> University of Provence - Marseille - France
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|