LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pradeep Padala (ppadala_at_[hidden])
Date: 2005-07-19 10:38:52


Thanks for the reply. My comments below.

> Correct.
>
> You might want to try this interactively (e.g., use "qsub -I ...") and
> see if you can get the same results.
>
> Also double check that your LD_LIBRARY_PATH set properly on all nodes
> such that the BLCR library can be found.

Yes. my ld.so.conf contains the right paths for blcr library.

>>3. Now, the tasks just continue. How do I induce faults here? If I kill
>> any of the tasks (the hello programs in this case), lamd thinks
>>that
>> the job is done and exits, cleaning up the context file as well.
>
>
> None of the LAM infrastructure should ever remove checkpoint files,
> except when replacing them with new ones (e.g., when you checkpoint a
> parallel job a second time).

There seems to be some problem with blcr as I get only one context file.
Does LAM/MPI depend on a particular version of blcr?

>>My question is, how do I induce faults? Should I kill lamd? Is there
>>anyway I can kill the process with lamcheckpoint, something like
>>lamcheckpoint --kill <mpirun pid>.
>
>
> LAM doesn't currently support such an option. Another way of doing it,
> however, would be to run lamcheckpoint and then lamclean.
>
>>4. How's the lamrestart supposed to be used? When I run lamrestart on
>> the context file generated, nothing happens. No new jobs in Torque,
>> no new processes on the machine. Can lamrestart submit the
>> (restarted) job back to Torque?
>
>
> No. lamrestart will only restart the LAM job. Hence, you must
> manually obtain a new Torque job and setup a new LAM universe (or
> re-use the old one) and then you can lamrestart the old job.

Let me get this straight. I lamcheckpoint the job, collect the context
files, run lamclean and run a new job that contains a lamrestart. Am I
right?

>>5. If I try cr_restart on the context file, I get a seg fault.
>
>
> I suspect that the problem is that you're not getting context files for
> all the parallel processes, and this leads to badness during the
> restart process. However, this could be due to BLCR itself failing.
>
> Are you able to checkpoint / restart serial processes?

Yes. Actually, I have integrated Torque+Maui with blcr as I am working
on some fault tolerance research. It would be great if I can checkpoint
LAM/MPI jobs as well.

-- 
Pradeep Padala
http://ppadala.blogspot.com