Hi,
I am trying to understand how LAM/MPI with BLCR works Torque + Maui.
I want to submit an MPI job through Maui+Torque, checkpoint it, kill it
and restart it later. This is what I am doing.
I have setup two FC4 machines pretty much identical with Torque+Maui and
LAM/MPI combined with blcr
1. I am using a submission script similar to follows
#PBS -l nodes=2:ppn=1
lamboot $PBS_NODEFILE
mpirun -ssi rpi crtcp -np 2 hello
lamhalt
2. I run lamcheckpoint on the node where mpirun runs and I get a single
context file in my home directory. According to the documentation,
context files for each mpi job should be generated.
3. Now, the tasks just continue. How do I induce faults here? If I kill
any of the tasks (the hello programs in this case), lamd thinks that
the job is done and exits, cleaning up the context file as well.
My question is, how do I induce faults? Should I kill lamd? Is there
anyway I can kill the process with lamcheckpoint, something like
lamcheckpoint --kill <mpirun pid>.
4. How's the lamrestart supposed to be used? When I run lamrestart on
the context file generated, nothing happens. No new jobs in Torque,
no new processes on the machine. Can lamrestart submit the
(restarted) job back to Torque?
5. If I try cr_restart on the context file, I get a seg fault.
I have checked the mailing list archives, but didn't find anything
directly related to this.
Any help is greatly appreciated.
Thanks,
--pradeep
P.S. I am using a patched version of blcr to make it work on FC4. The
patch was given to me by Paul Hargrove.
P.S.2 I sent a similar mail to Paul and he asked me to post it here.
|