LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-06-11 07:14:56


On Jun 5, 2005, at 1:18 PM, Pierre Valiron wrote:

> I have the following sequence in my batch jobs:
>
> TMPDIR="/some/unique/path"
> export TMPDIR
> mkdir -p
> lamboot
>
> cd $TMPDIR
> mpirun some_work
> cp results $HOME
>
> lamhalt
> rm -rf $TMPDIR
>
> The bug is that the lam daemon is not properly halted. After running a
> while I have discovered *thousands* of pending lam daemons on the
> machine...

Yoinks!

> I suppose lamhalt post some asynchronous request to the daemon, and if
> the TMPDIR is deleted too quickly the daemon is prevented to halt. If
> I add some sleep between lamhalt and rm the daemon is generally
> properly halted.

This is exactly what is happening. Generally, when you lamhalt, it
takes 1-4 seconds for the LAM universe to finish coming down *after*
lamhalt returns. Some relatively uninteresting daemon ordering issues
are the cause of this -- search this list's archives for discussions
about it, if you care.

> Of course there is a workaround using
>
> export LAM_MPI_SESSION_PREFIX="/some/permanent/path"
> export LAM_MPI_SESSION_SUFFIX="some_unique_name"

I'm a little confused -- this should suffer exactly the same problem as
you described above. The mechanism for where LAM's session directory
is located/found does not affect the takedown time of lamhalt.

> However it is elegant to use the unique TMPDIR to trigger a unique LAM
> universe... and this should work, or if it can't be fixed for some
> reason the doc should be updated accordingly.

You're probably right; this has bitten enough people that we should do
something about it.

However, I literally just noticed that the LAM tarballs do not include
the lamhalt man page (it exists -- I swear it!). !@#$@!$!!

I'll go fix that for 7.1.2...

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/