On Jul 29, 2009, at 3:12 AM, Shaun G. wrote:
> Each application run is restricted to a prescribed wall-clock time
> limit. If a run hasn't finished within the limit it should be
> terminated so the next run can start `clean'. Also, the executable
> application may hang or crash for some inputs.
>
> How can I check:
> a) if the application has crashed
> b) if the application hasn't completed running within the time limit
> c) if a) or b) are true, kill the MPI processes on all the nodes
> started by this application (the current run).
It sounds like you might be well-served with a resource manager, such
as Torque or SLURM (or any of a variety of others). RM's will queue
up jobs, run them when jobs are available, clean then up when done,
enforce wall-clock time limits, etc. They might do all the heavy
lifting in a robust manner so that you don't have to implement such
stuff yourself.
> The manual mentions `lamclean', but does it kill only my processes
> or other users' processes as well?.
It will only kill your processes.
> Also, is it enough to check the runtime of the `mpirun' script? if I
> kill `mpirun' after the time limit has been reached will this kill
> the MPI processes on the remote nodes?.
Wall clock time, sure. Not CPU time -- mpirun blocks most of the
time, so it accumulates very little CPU time.
> Lastly, the `search' function on the mailing archive reports `
> Sorry, you didn't specify any search criteria, so no search was
> performed. ' whatever text I typed in the search box.
Oops. Will have to check with the webmaster to see what went wrong
there... Thanks for the heads up.
--
Jeff Squyres
jsquyres_at_[hidden]
|