Josh,
blcr-0.4.2 & LAM/MPI-7.1.2b30/. Would you help checking the problem?
Thanks!
Yuan
Josh Hursey wrote:
>Yuan,
>
>Sorry I'm coming into this conversation late, so I might have missed
>some details.
>
>What version of BLCR and LAM/MPI are you using on these machines?
>
>Thanks,
>Josh
>
>On Jan 11, 2006, at 11:10 AM, Yuan Tang wrote:
>
>
>
>>From: Yuan Tang <yuantang_at_[hidden]>
>>Date: January 11, 2006 10:52:28 AM EST
>>To: lam-devel_at_[hidden]
>>Subject: Re: lam-devel Digest, Vol 143, Issue 1
>>
>>
>>I mean. The restarted LAM/MPI processes seem not be re-checkpointed.
>>That is,
>>
>>1. In the begining, I start the LAM/MPI program with 4 processes, for
>>example.
>>
>>2. call cr_checkpoint --term ${pid_mpirun}
>>
>>3. call cr_restart context.${pid_mpirun}, it could re-start the
>>LAM/MPI program as long as the lamd doesnot exit.
>>
>>4. But now, if I invoke cr_checkpoint ${pid_mpirun} again, the
>>restarted progam cannot be checkpointed anymore, it always report:
>>ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): No such process (in
>>the window of invoking cr_checkpoint)
>>
>>&
>>
>>One of the processes started by mpirun has exited with a nonzero exit
>>code. This typically indicates that the process finished in error.
>>If your process did not finish in error, be sure to include a "return
>>0" or "exit(0)" in your C code before exiting the application.
>>
>>PID 25298 failed on node n0 (160.36.57.50) due to signal 11.
>>-----------------------------------------------------------------------
>>------
>>rpwait failed: Success
>>n0<25119> ssi:crlam:unlinking
>>0:/home/yuantang/context.25119-n0-25120.phase1
>>n0<25119> ssi:crlam:unlinking
>>0:/home/yuantang/context.25119-n0-25121.phase1
>>n0<25119> ssi:crlam:unlinking
>>0:/home/yuantang/context.25119-n0-25122.phase1
>>n0<25119> ssi:crlam:unlinking
>>0:/home/yuantang/context.25119-n0-25123.phase1
>>
>>(in the window of restarted program).
>>
>>So, the restarted LAM/MPI processes can not be checkpointed, isn't it?
>>Thanks!
>>
>>Yuan
>>
>>lam-devel-request_at_[hidden] wrote:
>>
>>
>>
>>>>2. Even the lamd doesnot exit, if I invoke "cr_checkpoint --term
>>>>${pid_mpirun}" multiple times, the "cr_restart" will always restart
>>>>the program from the 1st/earliest checkpoint, which means the
>>>>subsequent checkpoint doesn't take any effect. Actually, if I delete
>>>>the context.${pid_mpirun} during the run of application, I found the
>>>>subsequent cr_checkpoint --term ${pid_mpirun} doesnot generate any
>>>>checkpoint file any more. Why?
>>>>
>>>>
>>>>
>>>If you assign cr_checkpoint with the "--term" option, a "SIGTERM"
>>>signal will send to all processes/threads in the lam universe, and
>>>the execution will abort. Your subsequent "cr_checkpoint --term
>>>${pid_mpirun}" can't find the corresponding process to checkpoint,
>>>but it should print a msg like " No such process".
>>>
>>>
>>>>Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
>>>>signal 11 -- SIGSEGV
>>>>
>>>>
>>>>
>>>I have not met this problem,;)
>>>
>>>
>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>----
>Josh Hursey
>jjhursey_at_[hidden]
>http://www.lam-mpi.org/
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
|