Hi,
I have found a strange bug with lamhalt when running many calculations
in batch mode under Solaris 10 opteron with lam/mpi 7.1.1. I suspect
the bug to be more general.
I need to keep distinct LAM universes because several jobs may run on
the machine at a time.
I have the following sequence in my batch jobs:
TMPDIR="/some/unique/path"
export TMPDIR
mkdir -p
lamboot
cd $TMPDIR
mpirun some_work
cp results $HOME
lamhalt
rm -rf $TMPDIR
The bug is that the lam daemon is not properly halted. After running a
while I have discovered *thousands* of pending lam daemons on the
machine...
I suppose lamhalt post some asynchronous request to the daemon, and if
the TMPDIR is deleted too quickly the daemon is prevented to halt. If I
add some sleep between lamhalt and rm the daemon is generally properly
halted.
Of course there is a workaround using
export LAM_MPI_SESSION_PREFIX="/some/permanent/path"
export LAM_MPI_SESSION_SUFFIX="some_unique_name"
However it is elegant to use the unique TMPDIR to trigger a unique LAM
universe... and this should work, or if it can't be fixed for some
reason the doc should be updated accordingly.
All the best for LAM and Open-MPI.
Pierre.
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr
_/ _/ _/ mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
|