attached mail follows:
I mean. The restarted LAM/MPI processes seem not be re-checkpointed.
That is,
1. In the begining, I start the LAM/MPI program with 4 processes, for
example.
2. call cr_checkpoint --term ${pid_mpirun}
3. call cr_restart context.${pid_mpirun}, it could re-start the LAM/MPI
program as long as the lamd doesnot exit.
4. But now, if I invoke cr_checkpoint ${pid_mpirun} again, the restarted
progam cannot be checkpointed anymore, it always report:
ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): No such process (in the
window of invoking cr_checkpoint)
&
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 25298 failed on node n0 (160.36.57.50) due to signal 11.
-----------------------------------------------------------------------------
rpwait failed: Success
n0<25119> ssi:crlam:unlinking 0:/home/yuantang/context.25119-n0-25120.phase1
n0<25119> ssi:crlam:unlinking 0:/home/yuantang/context.25119-n0-25121.phase1
n0<25119> ssi:crlam:unlinking 0:/home/yuantang/context.25119-n0-25122.phase1
n0<25119> ssi:crlam:unlinking 0:/home/yuantang/context.25119-n0-25123.phase1
(in the window of restarted program).
So, the restarted LAM/MPI processes can not be checkpointed, isn't it?
Thanks!
Yuan
lam-devel-request_at_[hidden] wrote:
>>2. Even the lamd doesnot exit, if I invoke "cr_checkpoint --term
>>${pid_mpirun}" multiple times, the "cr_restart" will always restart the
>>program from the 1st/earliest checkpoint, which means the subsequent
>>checkpoint doesn't take any effect. Actually, if I delete the
>>context.${pid_mpirun} during the run of application, I found the
>>subsequent cr_checkpoint --term ${pid_mpirun} doesnot generate any
>>checkpoint file any more. Why?
>>
>>
>If you assign cr_checkpoint with the "--term" option, a "SIGTERM" signal will send to all processes/threads in the lam universe, and the execution will abort. Your subsequent "cr_checkpoint --term ${pid_mpirun}" can't find the corresponding process to checkpoint, but it should print a msg like " No such process".
>
>
>>Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
>>signal 11 -- SIGSEGV
>>
>>
>I have not met this problem,;)
>
>
|