LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pierre Valiron (Pierre.Valiron_at_[hidden])
Date: 2005-06-05 12:18:02


Hi,

I have found a strange bug with lamhalt when running many calculations
in batch mode under Solaris 10 opteron with lam/mpi 7.1.1. I suspect
the bug to be more general.

I need to keep distinct LAM universes because several jobs may run on
the machine at a time.
I have the following sequence in my batch jobs:

    TMPDIR="/some/unique/path"
    export TMPDIR
    mkdir -p
    lamboot

    cd $TMPDIR
    mpirun some_work
    cp results $HOME

    lamhalt
    rm -rf $TMPDIR

The bug is that the lam daemon is not properly halted. After running a
while I have discovered *thousands* of pending lam daemons on the
machine...

I suppose lamhalt post some asynchronous request to the daemon, and if
the TMPDIR is deleted too quickly the daemon is prevented to halt. If I
add some sleep between lamhalt and rm the daemon is generally properly
halted.
 

Of course there is a workaround using

    export LAM_MPI_SESSION_PREFIX="/some/permanent/path"
    export LAM_MPI_SESSION_SUFFIX="some_unique_name"

However it is elegant to use the unique TMPDIR to trigger a unique LAM
universe... and this should work, or if it can't be fixed for some
reason the doc should be updated accordingly.

All the best for LAM and Open-MPI.
Pierre.

-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
       _/_/_/_/    _/       _/       Dr. Pierre VALIRON
      _/     _/   _/      _/   Laboratoire d'Astrophysique
     _/     _/   _/     _/    Observatoire de Grenoble / UJF
    _/_/_/_/    _/    _/    BP 53  F-38041 Grenoble Cedex 9 (France)
   _/          _/   _/      http://www-laog.obs.ujf-grenoble.fr
  _/          _/  _/        mail: Pierre.Valiron_at_[hidden]
 _/          _/ _/      Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/          _/_/