Dear All,
I have an application (a legacy executable) which was written to run with MPI. I need to repeatedly run this executable application (in each run in gets a different input file). The cluster uses P6 machines. No MPI processes are started on the master.
Each application run is restricted to a prescribed wall-clock time limit. If a run hasn't finished within the limit it should be terminated so the next run can start `clean'. Also, the executable application may hang or crash for some inputs.
How can I check:
a) if the application has crashed
b) if the application hasn't completed running within the time limit
c) if a) or b) are true, kill the MPI processes on all the nodes started by this application (the current run).
The manual mentions `lamclean', but does it kill only my processes or other users' processes as well?.
Also, is it enough to check the runtime of the `mpirun' script? if I kill `mpirun' after the time limit has been reached will this kill the MPI processes on the remote nodes?.
Lastly, the `search' function on the mailing archive reports `
Sorry, you didn't specify any search criteria, so no search was performed. ' whatever text I typed in the search box.
Cheers and thanks,
Shaun.
|