Hi Jeff & Liu,
I downloaded the lam-7.1.2b30 and installed it. But there're 2 problems:
1. If the lamd exited when I invoked "cr_checkpoint --term
${pid_mpirun}", then the "cr_restart context.{pid_mpirun}" could not
restart the whole program.
2. Even the lamd doesnot exit, if I invoke "cr_checkpoint --term
${pid_mpirun}" multiple times, the "cr_restart" will always restart the
program from the 1st/earliest checkpoint, which means the subsequent
checkpoint doesn't take any effect. Actually, if I delete the
context.${pid_mpirun} during the run of application, I found the
subsequent cr_checkpoint --term ${pid_mpirun} doesnot generate any
checkpoint file any more. Why?
Would you help me?
Thanks!
Yuan
lam-devel-request_at_[hidden] wrote:
>Send lam-devel mailing list submissions to
> lam-devel_at_[hidden]
>
>To subscribe or unsubscribe via the World Wide Web, visit
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>or, via email, send a message with subject or body 'help' to
> lam-devel-request_at_[hidden]
>
>You can reach the person managing the list at
> lam-devel-owner_at_[hidden]
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of lam-devel digest..."
>
>
>Today's Topics:
>
> 1. Re: Problem regarding Checkpoint/restart issue in LAM
> (Jeff Squyres)
> 2. Re: Problem regarding Checkpoint/restart issue in LAM
> (Liu Xuezhao)
> 3. Re: Problem regarding Checkpoint/restart issue in LAM
> (Jeff Squyres)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Thu, 5 Jan 2006 16:21:21 -0500
>From: Jeff Squyres <jsquyres_at_[hidden]>
>Subject: Re: [lam-devel] Problem regarding Checkpoint/restart issue in
> LAM
>To: LAM/MPI development issues <lam-devel_at_[hidden]>
>Message-ID: <473502DF-8FF5-4394-B627-5EF24CE304D0_at_[hidden]>
>Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
>On Jan 4, 2006, at 1:39 AM, Liu Xuezhao wrote:
>
>
>
>> I met this problem also, I think this is a bug of lam-7.1.1 but
>>it has not been confirmed by LAM developers.
>> You can use the previous version of LAM 7.0.6, but i have found
>>7.0.6 has another bug also.( It will produce the file "/tmp/lam-
>>xxx_at_node01/lam-crtcp-rank-0.txt" which will lead the execution
>>can't be restarted after the lamhalt and lamboot is executed).
>>
>>
>
>Sorry for the delay in replying to this -- the holidays and other
>fires prevented me from replying before now.
>
>Yes, as Brian mentioned, 7.0.6 inadvertently created the logfile in
>the LAM session directory, preventing you from restarting in a new
>LAM universe. The problem was fixed somewhere along the way -- I
>don't remember offhand if it was in 7.1 or 7.1.1, but I know that
>7.1.1 does not have that problem.
>
>
>
>> It seems that fault-tolerance is not a essentially important
>>feature of LAM-MPI? Can LAM's developers tell us something about
>>the project plan about "fault-tolerance" of LAM-MPI?
>>
>>
>
>LAM/MPI is pretty much in a maintenance mode -- we are spending the
>vast majority of our time on Open MPI these days (see the notice on
>the front page of the LAM web site). LAM is certainly not going away
>-- we will continue to provide bug fixes, etc. But little new work
>is happening in LAM.
>
>In Open MPI, we plan to continue our FT work as well as branch off in
>several new directions of FT. In short, Open MPI is shaping up to be
>a much better environment for FT experimentation and research than
>LAM was (not that there is anything wrong with LAM -- it's just that
>Open MPI was designed with all the experience gained from LAM and
>several other systems, and therefore we did it "better").
>Eventually, we'll support a variety of FT mechanisms in Open MPI.
>
>Specifically, now that Open MPI is in a fairly stable state, BLCR
>support is slated to be added to Open MPI this upcoming spring. Work
>for this support is underway, but there's nothing interesting to
>report to users yet -- the initial required infrastructure for FT is
>being added right now.
>
>
>
>>>BTW: The LAM/MPI version I use is 7.1.1
>>>
>>>And the problem seem to be when I invoke the
>>>
>>>
>>>
>>>>cr_checkpoint --term $PID_of_mpirun
>>>>
>>>>
>
>Are you sure that blcr support is correctly installed in your LAM
>installation? Check the output of lamifo:
>
>shell$ laminfo | grep blcr
> SSI cr: blcr (API v1.0, Module v1.1)
>
>If you see that "SSI" line, then blcr support is properly included in
>your LAM installation. The question then becomes why the images were
>not properly created when you cr_checkpointed mpirun.
>
>The best way to do this is to turn up the verbosity of the cr system
>and ensure that everything is happening properly. For example:
>
>shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
>cr blcr -np 2 your_application
>
>You initially should see a bunch of output to stderr indicating that
>blcr was selected. When you invoke cr_checkpoint, you should see all
>the steps that LAM goes through to checkpoint.
>
>Does that happen for you?
>
>--
>{+} Jeff Squyres
>{+} The Open MPI Project
>{+} http://www.open-mpi.org/
>
>
>
>
>------------------------------
>
>Message: 2
>Date: Fri, 6 Jan 2006 16:11:32 +0800
>From: "Liu Xuezhao" <lxz_at_[hidden]>
>Subject: Re: [lam-devel] Problem regarding Checkpoint/restart issue in
> LAM
>To: LAM/MPI development issues <lam-devel_at_[hidden]>
>Message-ID: <20060106081222.9B54DFB046_at_[hidden]>
>Content-Type: text/plain; charset="GB2312"
>
>Hi,
> I think the problem met by Yuan Tang is same with me. Now i have resolved it.
> By tracing the soucecode of lam-7.1.1, I found the reason is mpirun can't received the correct parameters from LAM's initialization. The crlam module name received by mpirun is "none" but not the expectant "blcr".
> Int the file: /share/mpi/lammpiinit.c, in the function "lam_send_selected_ssi_modules", at the line of 571, the codes is:
>---------------
>/*
> * Also copy the selected CRMPI module's name to send to mpirun.
> *
> * It is possible that no CR modules were selected. So handle that case.
> */
>#if 0
> if (lam_ssi_crmpi_base_available != NULL)
> module = (lam_ssi_module_t *) al_top(lam_ssi_crmpi_base_available);
>#else
> /* JMS, for the moment, due to bug 682, we're just going to skip
> checking cr modules. */
> module = NULL;
>#endif
>-------------------
> The same section of lam-7.0.6 is:
>-------------------
>/*
> * Also copy the selected CRMPI module's name to send to mpirun.
> *
> * It is possible that no CR modules were selected. So handle that case.
> */
> if (lam_ssi_crmpi_base_available != NULL)
> module = (lam_ssi_module_t *) al_top(lam_ssi_crmpi_base_available);
>-------------------
> I don't konw why the 2 lines code is been annotated and changed to "module = NULL;" at 7.1.1, perhaps there are some other reasons let the LAM developers to do that change.
> I changed it back, let the "#if 0" to be "#if 1" only. And recompiled and reinstalled it.
> Now the mpirun can received the correct parameters, and the applications can be cr_checkpoint and cr_restart correctly.
> Thanks.
>
>Xuezhao
>2006-01-06
>======= 2006-01-05 16:21:00 Jeff Squyres wrote?=======
>
>
>>> I met this problem also, I think this is a bug of lam-7.1.1 but
>>>it has not been confirmed by LAM developers.
>>>
>>>
>
>
>
>>Are you sure that blcr support is correctly installed in your LAM
>>installation? Check the output of lamifo:
>>
>>shell$ laminfo | grep blcr
>> SSI cr: blcr (API v1.0, Module v1.1)
>>
>>If you see that "SSI" line, then blcr support is properly included in
>>your LAM installation. The question then becomes why the images were
>>not properly created when you cr_checkpointed mpirun.
>>
>>The best way to do this is to turn up the verbosity of the cr system
>>and ensure that everything is happening properly. For example:
>>
>>shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
>>cr blcr -np 2 your_application
>>
>>You initially should see a bunch of output to stderr indicating that
>>blcr was selected. When you invoke cr_checkpoint, you should see all
>>the steps that LAM goes through to checkpoint.
>>
>>Does that happen for you?
>>
>>--
>>{+} Jeff Squyres
>>{+} The Open MPI Project
>>{+} http://www.open-mpi.org/
>>
>>
>>
>
>= = = = = = = = = = = = = = = = = = = =
>
>
>
>
>
>
>
>------------------------------
>
>Message: 3
>Date: Fri, 6 Jan 2006 11:45:55 -0500
>From: Jeff Squyres <jsquyres_at_[hidden]>
>Subject: Re: [lam-devel] Problem regarding Checkpoint/restart issue in
> LAM
>To: LAM/MPI development issues <lam-devel_at_[hidden]>
>Message-ID: <F78E5102-3405-421E-A035-14ECF87FE358_at_[hidden]>
>Content-Type: text/plain; charset=UTF-8; delsp=yes; format=flowed
>
>Yoinks; sorry, this is my fault for not noticing earlier. :-(
>
>I did my tests yesterday with the LAM subversion trunk -- *not*
>7.1.1. Specifically, what you noted has been fixed on both the SVN
>trunk and the 7.1.2 beta. Here's the note in the HISTORY file:
>
>- Fix a problem inadvertantly caused by bug 682: instead of trying to
>rectify crmpi modules that are sent by MPI processes to the spawning
>agent, simply disallow MPI_COMM_SPAWN'ed processes from being
>checkpointable.
>
>So that code now actually reads:
>
> if (mpi_nparent == 0) {
> if (lam_ssi_crmpi_base_available != NULL) {
> module = (lam_ssi_module_t *) al_top
>(lam_ssi_crmpi_base_available);
> }
> } else {
> module = NULL;
> }
>
>I totally forgot that we had fixed this in the 7.1.2 beta; mea culpa
>for not identifying this earlier. :-(
>
>Can you try the 7.1.2 beta and see if that works for you?
>
> http://www.lam-mpi.org/beta/
>
>
>On Jan 6, 2006, at 3:11 AM, Liu Xuezhao wrote:
>
>
>
>>Hi,
>> I think the problem met by Yuan Tang is same with me. Now i have
>>resolved it.
>> By tracing the soucecode of lam-7.1.1, I found the reason is
>>mpirun can't received the correct parameters from LAM's
>>initialization. The crlam module name received by mpirun is "none"
>>but not the expectant "blcr".
>> Int the file: /share/mpi/lammpiinit.c, in the function
>>"lam_send_selected_ssi_modules", at the line of 571, the codes is:
>>---------------
>>/*
>> * Also copy the selected CRMPI module's name to send to mpirun.
>> *
>> * It is possible that no CR modules were selected. So handle
>>that case.
>> */
>>#if 0
>> if (lam_ssi_crmpi_base_available != NULL)
>> module = (lam_ssi_module_t *) al_top
>>(lam_ssi_crmpi_base_available);
>>#else
>> /* JMS, for the moment, due to bug 682, we're just going to skip
>> checking cr modules. */
>> module = NULL;
>>#endif
>>-------------------
>> The same section of lam-7.0.6 is:
>>-------------------
>>/*
>> * Also copy the selected CRMPI module's name to send to mpirun.
>> *
>> * It is possible that no CR modules were selected. So handle
>>that case.
>> */
>> if (lam_ssi_crmpi_base_available != NULL)
>> module = (lam_ssi_module_t *) al_top
>>(lam_ssi_crmpi_base_available);
>>-------------------
>> I don't konw why the 2 lines code is been annotated and changed
>>to "module = NULL;" at 7.1.1, perhaps there are some other reasons
>>let the LAM developers to do that change.
>> I changed it back, let the "#if 0" to be "#if 1" only. And
>>recompiled and reinstalled it.
>> Now the mpirun can received the correct parameters, and the
>>applications can be cr_checkpoint and cr_restart correctly.
>> Thanks.
>>
>>Xuezhao
>>2006-01-06
>>======= 2006-01-05 16:21:00 Jeff Squyres wrote?=======
>>
>>
>>>> I met this problem also, I think this is a bug of lam-7.1.1 but
>>>>it has not been confirmed by LAM developers.
>>>>
>>>>
>>>Are you sure that blcr support is correctly installed in your LAM
>>>installation? Check the output of lamifo:
>>>
>>>shell$ laminfo | grep blcr
>>> SSI cr: blcr (API v1.0, Module v1.1)
>>>
>>>If you see that "SSI" line, then blcr support is properly included in
>>>your LAM installation. The question then becomes why the images were
>>>not properly created when you cr_checkpointed mpirun.
>>>
>>>The best way to do this is to turn up the verbosity of the cr system
>>>and ensure that everything is happening properly. For example:
>>>
>>>shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
>>>cr blcr -np 2 your_application
>>>
>>>You initially should see a bunch of output to stderr indicating that
>>>blcr was selected. When you invoke cr_checkpoint, you should see all
>>>the steps that LAM goes through to checkpoint.
>>>
>>>Does that happen for you?
>>>
>>>--
>>>{+} Jeff Squyres
>>>{+} The Open MPI Project
>>>{+} http://www.open-mpi.org/
>>>
>>>
>>>
>>= = = = = = = = = = = = = = = = = = = =
>>
>>
>>
>>
>>
>>_______________________________________________
>>lam-devel mailing list
>>lam-devel_at_[hidden]
>>http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>>
>>
>
>
>--
>{+} Jeff Squyres
>{+} The Open MPI Project
>{+} http://www.open-mpi.org/
>
>
>
>
>
>End of lam-devel Digest, Vol 140, Issue 1
>*****************************************
>
>
|