- Next message: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Previous message: Yuan Tang: "Re: lam-devel Digest, Vol 140, Issue 1"
- Next in thread: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Maybe reply: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Reply: Jeff Squyres: "Re: lam-devel Digest, Vol 141, Issue 1"
Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
signal 11 -- SIGSEGV, and output something like following:
n0<2193> ssi:crmpi:mpi_lock: interrupting RPI
n0<2193> ssi:crmpi:mpi_lock: interrupting coll modules
n0<2193> ssi:crmpi:mpi_lock: trying to lock MPI mutex
n0<2193> ssi:crmpi:mpi_lock: lam_mpi_mutex held by app_thread; try again
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 2357 failed on node n0 (160.36.57.50) due to signal 11.
-----------------------------------------------------------------------------
rpwait failed: Success
n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-n0-2193.phase1
n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-n0-2194.phase1
n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-n0-2195.phase1
n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-n0-2196.phase1
Thanks!
Yuan
lam-devel-request_at_[hidden] wrote:
>Send lam-devel mailing list submissions to
> lam-devel_at_[hidden]
>
>To subscribe or unsubscribe via the World Wide Web, visit
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>or, via email, send a message with subject or body 'help' to
> lam-devel-request_at_[hidden]
>
>You can reach the person managing the list at
> lam-devel-owner_at_[hidden]
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of lam-devel digest..."
>
>
>Today's Topics:
>
> 1. Re: Problem regarding Checkpoint/restart issue in LAM
> (Liu Xuezhao)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Sat, 7 Jan 2006 11:24:44 +0800
>From: "Liu Xuezhao" <lxz_at_[hidden]>
>Subject: Re: [lam-devel] Problem regarding Checkpoint/restart issue in
> LAM
>To: LAM/MPI development issues <lam-devel_at_[hidden]>
>Message-ID: <20060107032512.A1833FB045_at_[hidden]>
>Content-Type: text/plain; charset="GB2312"
>
>
> You are so kind and earnest, :)
> I think the 7.1.2 beta will be work. I will test it 2 days latter, if there are problems still with 7.1.2 beta I will report it to you.
> Thanks.
>
>======= 2006-01-06 11:45:00 Jeff Squyres wrote?=======
>
>
>
>>Yoinks; sorry, this is my fault for not noticing earlier. :-(
>>
>>I did my tests yesterday with the LAM subversion trunk -- *not*
>>7.1.1. Specifically, what you noted has been fixed on both the SVN
>>trunk and the 7.1.2 beta. Here's the note in the HISTORY file:
>>
>>- Fix a problem inadvertantly caused by bug 682: instead of trying to
>>rectify crmpi modules that are sent by MPI processes to the spawning
>>agent, simply disallow MPI_COMM_SPAWN'ed processes from being
>>checkpointable.
>>
>>So that code now actually reads:
>>
>> if (mpi_nparent == 0) {
>> if (lam_ssi_crmpi_base_available != NULL) {
>> module = (lam_ssi_module_t *) al_top
>>(lam_ssi_crmpi_base_available);
>> }
>> } else {
>> module = NULL;
>> }
>>
>>I totally forgot that we had fixed this in the 7.1.2 beta; mea culpa
>>for not identifying this earlier. :-(
>>
>>Can you try the 7.1.2 beta and see if that works for you?
>>
>> http://www.lam-mpi.org/beta/
>>
>>
>>
>>
>
>
>= = = = = = = = = = = = = = = = = = = =
>
>
>
>
>
>
>End of lam-devel Digest, Vol 141, Issue 1
>*****************************************
>
>
- Next message: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Previous message: Yuan Tang: "Re: lam-devel Digest, Vol 140, Issue 1"
- Next in thread: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Maybe reply: Yuan Tang: "Re: lam-devel Digest, Vol 141, Issue 1"
- Reply: Jeff Squyres: "Re: lam-devel Digest, Vol 141, Issue 1"
|