For the benefit of the web archives and anyone else who is following
this thread, it's a sticky issue that we're working on off-list.
We'll post the final resolution back here.
(ping me off-list if you want to be involved in the conversation)
On Jan 11, 2006, at 1:24 PM, Yuan Tang wrote:
> Josh,
>
> blcr-0.4.2 & LAM/MPI-7.1.2b30/. Would you help checking the problem?
>
> Thanks!
>
> Yuan
>
> Josh Hursey wrote:
>
>> Yuan,
>>
>> Sorry I'm coming into this conversation late, so I might have missed
>> some details.
>>
>> What version of BLCR and LAM/MPI are you using on these machines?
>>
>> Thanks,
>> Josh
>>
>> On Jan 11, 2006, at 11:10 AM, Yuan Tang wrote:
>>
>>
>>
>>> From: Yuan Tang <yuantang_at_[hidden]>
>>> Date: January 11, 2006 10:52:28 AM EST
>>> To: lam-devel_at_[hidden]
>>> Subject: Re: lam-devel Digest, Vol 143, Issue 1
>>>
>>>
>>> I mean. The restarted LAM/MPI processes seem not be re-checkpointed.
>>> That is,
>>>
>>> 1. In the begining, I start the LAM/MPI program with 4 processes,
>>> for
>>> example.
>>>
>>> 2. call cr_checkpoint --term ${pid_mpirun}
>>>
>>> 3. call cr_restart context.${pid_mpirun}, it could re-start the
>>> LAM/MPI program as long as the lamd doesnot exit.
>>>
>>> 4. But now, if I invoke cr_checkpoint ${pid_mpirun} again, the
>>> restarted progam cannot be checkpointed anymore, it always report:
>>> ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): No such process (in
>>> the window of invoking cr_checkpoint)
>>>
>>> &
>>>
>>> One of the processes started by mpirun has exited with a nonzero
>>> exit
>>> code. This typically indicates that the process finished in error.
>>> If your process did not finish in error, be sure to include a
>>> "return
>>> 0" or "exit(0)" in your C code before exiting the application.
>>>
>>> PID 25298 failed on node n0 (160.36.57.50) due to signal 11.
>>> --------------------------------------------------------------------
>>> ---
>>> ------
>>> rpwait failed: Success
>>> n0<25119> ssi:crlam:unlinking
>>> 0:/home/yuantang/context.25119-n0-25120.phase1
>>> n0<25119> ssi:crlam:unlinking
>>> 0:/home/yuantang/context.25119-n0-25121.phase1
>>> n0<25119> ssi:crlam:unlinking
>>> 0:/home/yuantang/context.25119-n0-25122.phase1
>>> n0<25119> ssi:crlam:unlinking
>>> 0:/home/yuantang/context.25119-n0-25123.phase1
>>>
>>> (in the window of restarted program).
>>>
>>> So, the restarted LAM/MPI processes can not be checkpointed,
>>> isn't it?
>>> Thanks!
>>>
>>> Yuan
>>>
>>> lam-devel-request_at_[hidden] wrote:
>>>
>>>
>>>
>>>>> 2. Even the lamd doesnot exit, if I invoke "cr_checkpoint --term
>>>>> ${pid_mpirun}" multiple times, the "cr_restart" will always
>>>>> restart
>>>>> the program from the 1st/earliest checkpoint, which means the
>>>>> subsequent checkpoint doesn't take any effect. Actually, if I
>>>>> delete
>>>>> the context.${pid_mpirun} during the run of application, I
>>>>> found the
>>>>> subsequent cr_checkpoint --term ${pid_mpirun} doesnot generate any
>>>>> checkpoint file any more. Why?
>>>>>
>>>>>
>>>>>
>>>> If you assign cr_checkpoint with the "--term" option, a "SIGTERM"
>>>> signal will send to all processes/threads in the lam universe, and
>>>> the execution will abort. Your subsequent "cr_checkpoint --term
>>>> ${pid_mpirun}" can't find the corresponding process to checkpoint,
>>>> but it should print a msg like " No such process".
>>>>
>>>>
>>>>> Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
>>>>> signal 11 -- SIGSEGV
>>>>>
>>>>>
>>>>>
>>>> I have not met this problem,;)
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>>
>> ----
>> Josh Hursey
>> jjhursey_at_[hidden]
>> http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|