On Wed, 18 May 2005, Travis Spencer wrote:
> As students learn how LAM works, they often start a cluster of nodes
> and leave them running without properly cleaning them up
The usual answer to this problem is: use a batch/queueing system which
is able to properly clean up after each job. LAM's integration with
PBS/Torque using the tm boot module allows this; for SGE, there is a
recent document that describes the tight integration at:
http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html
> Is there a way that we (the system administrators) can determine
> that a processes stated by LAM is a runaway (i.e., an abandoned
> program)?
It's difficult and often site-specific or even program-specific how to
decide whether a process is runaway or not. So there is no generic
answer...
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]
|