LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-01-11 11:38:06


Yuan,

Sorry I'm coming into this conversation late, so I might have missed
some details.

What version of BLCR and LAM/MPI are you using on these machines?

Thanks,
Josh

On Jan 11, 2006, at 11:10 AM, Yuan Tang wrote:

>
>
> From: Yuan Tang <yuantang_at_[hidden]>
> Date: January 11, 2006 10:52:28 AM EST
> To: lam-devel_at_[hidden]
> Subject: Re: lam-devel Digest, Vol 143, Issue 1
>
>
> I mean. The restarted LAM/MPI processes seem not be re-checkpointed.
> That is,
>
> 1. In the begining, I start the LAM/MPI program with 4 processes, for
> example.
>
> 2. call cr_checkpoint --term ${pid_mpirun}
>
> 3. call cr_restart context.${pid_mpirun}, it could re-start the
> LAM/MPI program as long as the lamd doesnot exit.
>
> 4. But now, if I invoke cr_checkpoint ${pid_mpirun} again, the
> restarted progam cannot be checkpointed anymore, it always report:
> ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): No such process (in
> the window of invoking cr_checkpoint)
>
> &
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 25298 failed on node n0 (160.36.57.50) due to signal 11.
> -----------------------------------------------------------------------
> ------
> rpwait failed: Success
> n0<25119> ssi:crlam:unlinking
> 0:/home/yuantang/context.25119-n0-25120.phase1
> n0<25119> ssi:crlam:unlinking
> 0:/home/yuantang/context.25119-n0-25121.phase1
> n0<25119> ssi:crlam:unlinking
> 0:/home/yuantang/context.25119-n0-25122.phase1
> n0<25119> ssi:crlam:unlinking
> 0:/home/yuantang/context.25119-n0-25123.phase1
>
> (in the window of restarted program).
>
> So, the restarted LAM/MPI processes can not be checkpointed, isn't it?
> Thanks!
>
> Yuan
>
> lam-devel-request_at_[hidden] wrote:
>
>>> 2. Even the lamd doesnot exit, if I invoke "cr_checkpoint --term
>>> ${pid_mpirun}" multiple times, the "cr_restart" will always restart
>>> the program from the 1st/earliest checkpoint, which means the
>>> subsequent checkpoint doesn't take any effect. Actually, if I delete
>>> the context.${pid_mpirun} during the run of application, I found the
>>> subsequent cr_checkpoint --term ${pid_mpirun} doesnot generate any
>>> checkpoint file any more. Why?
>>>
>> If you assign cr_checkpoint with the "--term" option, a "SIGTERM"
>> signal will send to all processes/threads in the lam universe, and
>> the execution will abort. Your subsequent "cr_checkpoint --term
>> ${pid_mpirun}" can't find the corresponding process to checkpoint,
>> but it should print a msg like " No such process".
>>> Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
>>> signal 11 -- SIGSEGV
>>>
>> I have not met this problem,;)
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

----
Josh Hursey
jjhursey_at_[hidden]
http://www.lam-mpi.org/