LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2007-07-17 22:59:19


On Jul 17, 2007, at 7:36 AM, Vechinski, Douglas A. wrote:

> I’m using lam to run a large number of parallel jobs on a cluster
> using the PBS batch system. Each script that is submitted performs
> a lamboot command. It is very likely that more than one job may be
> started on a single node. I have been setting the
> LAM_MPI_SESSION_SUFFIX to a job ID number so that there is a unique
> LAM deamon associated with each batch job that is submitted. My
> particular question is what happens when I execute the lamhalt
> command. Since there is no option to specify the appropriate
> suffix for which lam deamon to close what happens. Will lamhalt
> only halt the lam deamon that was used for that particular job. If
> so, how does it know which one to kill. I don’t want lamhalt to
> halt the deamon for a different job that happens to be running one
> the same node.
>

A given LAM daemon only knows the contact information for other
daemons in it's "universe", which is the group of daemons started
with the 'lamboot' command. Other daemons started with other calls
to 'lamboot' are in a different universe and are not in communication
with the first universe.

Lamhalt just sends a message to its local LAM daemon to get all the
daemons in the universe, then sends all of them a die message. The
universe to get the information from is determined by the session
directory (so usually the SESSION_SUFFIX).

So as long as the same value for LAM_MPI_SESSION_SUFFIX that was
there when lamboot was run is there when lamhalt is run, only those
daemons started by the given lamboot will be killed. All others will
continue on without disruption.

Brian

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!