Thanks for the reply. My comments below.
> Correct.
>
> You might want to try this interactively (e.g., use "qsub -I ...") and
> see if you can get the same results.
>
> Also double check that your LD_LIBRARY_PATH set properly on all nodes
> such that the BLCR library can be found.
Yes. my ld.so.conf contains the right paths for blcr library.
>>3. Now, the tasks just continue. How do I induce faults here? If I kill
>> any of the tasks (the hello programs in this case), lamd thinks
>>that
>> the job is done and exits, cleaning up the context file as well.
>
>
> None of the LAM infrastructure should ever remove checkpoint files,
> except when replacing them with new ones (e.g., when you checkpoint a
> parallel job a second time).
There seems to be some problem with blcr as I get only one context file.
Does LAM/MPI depend on a particular version of blcr?
>>My question is, how do I induce faults? Should I kill lamd? Is there
>>anyway I can kill the process with lamcheckpoint, something like
>>lamcheckpoint --kill <mpirun pid>.
>
>
> LAM doesn't currently support such an option. Another way of doing it,
> however, would be to run lamcheckpoint and then lamclean.
>
>>4. How's the lamrestart supposed to be used? When I run lamrestart on
>> the context file generated, nothing happens. No new jobs in Torque,
>> no new processes on the machine. Can lamrestart submit the
>> (restarted) job back to Torque?
>
>
> No. lamrestart will only restart the LAM job. Hence, you must
> manually obtain a new Torque job and setup a new LAM universe (or
> re-use the old one) and then you can lamrestart the old job.
Let me get this straight. I lamcheckpoint the job, collect the context
files, run lamclean and run a new job that contains a lamrestart. Am I
right?
>>5. If I try cr_restart on the context file, I get a seg fault.
>
>
> I suspect that the problem is that you're not getting context files for
> all the parallel processes, and this leads to badness during the
> restart process. However, this could be due to BLCR itself failing.
>
> Are you able to checkpoint / restart serial processes?
Yes. Actually, I have integrated Torque+Maui with blcr as I am working
on some fault tolerance research. It would be great if I can checkpoint
LAM/MPI jobs as well.
--
Pradeep Padala
http://ppadala.blogspot.com
|