On Jun 5, 2005, at 1:18 PM, Pierre Valiron wrote:
> I have the following sequence in my batch jobs:
>
> TMPDIR="/some/unique/path"
> export TMPDIR
> mkdir -p
> lamboot
>
> cd $TMPDIR
> mpirun some_work
> cp results $HOME
>
> lamhalt
> rm -rf $TMPDIR
>
> The bug is that the lam daemon is not properly halted. After running a
> while I have discovered *thousands* of pending lam daemons on the
> machine...
Yoinks!
> I suppose lamhalt post some asynchronous request to the daemon, and if
> the TMPDIR is deleted too quickly the daemon is prevented to halt. If
> I add some sleep between lamhalt and rm the daemon is generally
> properly halted.
This is exactly what is happening. Generally, when you lamhalt, it
takes 1-4 seconds for the LAM universe to finish coming down *after*
lamhalt returns. Some relatively uninteresting daemon ordering issues
are the cause of this -- search this list's archives for discussions
about it, if you care.
> Of course there is a workaround using
>
> export LAM_MPI_SESSION_PREFIX="/some/permanent/path"
> export LAM_MPI_SESSION_SUFFIX="some_unique_name"
I'm a little confused -- this should suffer exactly the same problem as
you described above. The mechanism for where LAM's session directory
is located/found does not affect the takedown time of lamhalt.
> However it is elegant to use the unique TMPDIR to trigger a unique LAM
> universe... and this should work, or if it can't be fixed for some
> reason the doc should be updated accordingly.
You're probably right; this has bitten enough people that we should do
something about it.
However, I literally just noticed that the LAM tarballs do not include
the lamhalt man page (it exists -- I swear it!). !@#$@!$!!
I'll go fix that for 7.1.2...
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|