LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-14 08:36:13


Per your later mails, has this problem been fixed? It didn't seem
like you were still running into segv problems anymore.

On Jan 9, 2006, at 11:36 AM, Yuan Tang wrote:

> Normally, re-invoke the "cr_checkpoint ${pid_mpirun}" will cause a
> signal 11 -- SIGSEGV, and output something like following:
>
> n0<2193> ssi:crmpi:mpi_lock: interrupting RPI
> n0<2193> ssi:crmpi:mpi_lock: interrupting coll modules
> n0<2193> ssi:crmpi:mpi_lock: trying to lock MPI mutex
> n0<2193> ssi:crmpi:mpi_lock: lam_mpi_mutex held by app_thread; try
> again
> ----------------------------------------------------------------------
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 2357 failed on node n0 (160.36.57.50) due to signal 11.
> ----------------------------------------------------------------------
> -------
> rpwait failed: Success
> n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-
> n0-2193.phase1
> n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-
> n0-2194.phase1
> n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-
> n0-2195.phase1
> n0<2192> ssi:crlam:unlinking 0:/home/yuantang/context.2192-
> n0-2196.phase1
>
> Thanks!
>
> Yuan
>
> lam-devel-request_at_[hidden] wrote:
>
>> Send lam-devel mailing list submissions to
>> lam-devel_at_[hidden]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>> or, via email, send a message with subject or body 'help' to
>> lam-devel-request_at_[hidden]
>>
>> You can reach the person managing the list at
>> lam-devel-owner_at_[hidden]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of lam-devel digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: Problem regarding Checkpoint/restart issue in LAM
>> (Liu Xuezhao)
>>
>>
>> ---------------------------------------------------------------------
>> -
>>
>> Message: 1
>> Date: Sat, 7 Jan 2006 11:24:44 +0800
>> From: "Liu Xuezhao" <lxz_at_[hidden]>
>> Subject: Re: [lam-devel] Problem regarding Checkpoint/restart
>> issue in
>> LAM
>> To: LAM/MPI development issues <lam-devel_at_[hidden]>
>> Message-ID: <20060107032512.A1833FB045_at_[hidden]>
>> Content-Type: text/plain; charset="GB2312"
>>
>>
>> You are so kind and earnest, :)
>> I think the 7.1.2 beta will be work. I will test it 2 days
>> latter, if there are problems still with 7.1.2 beta I will report
>> it to you.
>> Thanks.
>>
>> ======= 2006-01-06 11:45:00 Jeff Squyres wrote?=======
>>
>>
>>
>>> Yoinks; sorry, this is my fault for not noticing earlier. :-(
>>>
>>> I did my tests yesterday with the LAM subversion trunk -- *not*
>>> 7.1.1. Specifically, what you noted has been fixed on both the SVN
>>> trunk and the 7.1.2 beta. Here's the note in the HISTORY file:
>>>
>>> - Fix a problem inadvertantly caused by bug 682: instead of
>>> trying to
>>> rectify crmpi modules that are sent by MPI processes to the spawning
>>> agent, simply disallow MPI_COMM_SPAWN'ed processes from being
>>> checkpointable.
>>>
>>> So that code now actually reads:
>>>
>>> if (mpi_nparent == 0) {
>>> if (lam_ssi_crmpi_base_available != NULL) {
>>> module = (lam_ssi_module_t *) al_top
>>> (lam_ssi_crmpi_base_available);
>>> }
>>> } else {
>>> module = NULL;
>>> }
>>>
>>> I totally forgot that we had fixed this in the 7.1.2 beta; mea culpa
>>> for not identifying this earlier. :-(
>>>
>>> Can you try the 7.1.2 beta and see if that works for you?
>>>
>>> http://www.lam-mpi.org/beta/
>>>
>>>
>>>
>>>
>>
>>
>> = = = = = = = = = = = = = = = = = = = =
>>
>>
>>
>>
>>
>>
>> End of lam-devel Digest, Vol 141, Issue 1
>> *****************************************
>>
>>
>
> _______________________________________________
> lam-devel mailing list
> lam-devel_at_[hidden]
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/