LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (Bogdan.Costescu_at_[hidden])
Date: 2005-05-19 08:06:03


On Wed, 18 May 2005, Travis Spencer wrote:

> As students learn how LAM works, they often start a cluster of nodes
> and leave them running without properly cleaning them up

The usual answer to this problem is: use a batch/queueing system which
is able to properly clean up after each job. LAM's integration with
PBS/Torque using the tm boot module allows this; for SGE, there is a
recent document that describes the tight integration at:

http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html

> Is there a way that we (the system administrators) can determine
> that a processes stated by LAM is a runaway (i.e., an abandoned
> program)?

It's difficult and often site-specific or even program-specific how to
decide whether a process is runaway or not. So there is no generic
answer...

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]