Yoinks; sorry, this is my fault for not noticing earlier. :-(
I did my tests yesterday with the LAM subversion trunk -- *not*
7.1.1. Specifically, what you noted has been fixed on both the SVN
trunk and the 7.1.2 beta. Here's the note in the HISTORY file:
- Fix a problem inadvertantly caused by bug 682: instead of trying to
rectify crmpi modules that are sent by MPI processes to the spawning
agent, simply disallow MPI_COMM_SPAWN'ed processes from being
checkpointable.
So that code now actually reads:
if (mpi_nparent == 0) {
if (lam_ssi_crmpi_base_available != NULL) {
module = (lam_ssi_module_t *) al_top
(lam_ssi_crmpi_base_available);
}
} else {
module = NULL;
}
I totally forgot that we had fixed this in the 7.1.2 beta; mea culpa
for not identifying this earlier. :-(
Can you try the 7.1.2 beta and see if that works for you?
http://www.lam-mpi.org/beta/
On Jan 6, 2006, at 3:11 AM, Liu Xuezhao wrote:
> Hi,
> I think the problem met by Yuan Tang is same with me. Now i have
> resolved it.
> By tracing the soucecode of lam-7.1.1, I found the reason is
> mpirun can't received the correct parameters from LAM's
> initialization. The crlam module name received by mpirun is "none"
> but not the expectant "blcr".
> Int the file: /share/mpi/lammpiinit.c, in the function
> "lam_send_selected_ssi_modules", at the line of 571, the codes is:
> ---------------
> /*
> * Also copy the selected CRMPI module's name to send to mpirun.
> *
> * It is possible that no CR modules were selected. So handle
> that case.
> */
> #if 0
> if (lam_ssi_crmpi_base_available != NULL)
> module = (lam_ssi_module_t *) al_top
> (lam_ssi_crmpi_base_available);
> #else
> /* JMS, for the moment, due to bug 682, we're just going to skip
> checking cr modules. */
> module = NULL;
> #endif
> -------------------
> The same section of lam-7.0.6 is:
> -------------------
> /*
> * Also copy the selected CRMPI module's name to send to mpirun.
> *
> * It is possible that no CR modules were selected. So handle
> that case.
> */
> if (lam_ssi_crmpi_base_available != NULL)
> module = (lam_ssi_module_t *) al_top
> (lam_ssi_crmpi_base_available);
> -------------------
> I don't konw why the 2 lines code is been annotated and changed
> to "module = NULL;" at 7.1.1, perhaps there are some other reasons
> let the LAM developers to do that change.
> I changed it back, let the "#if 0" to be "#if 1" only. And
> recompiled and reinstalled it.
> Now the mpirun can received the correct parameters, and the
> applications can be cr_checkpoint and cr_restart correctly.
> Thanks.
>
> Xuezhao
> 2006-01-06
> ======= 2006-01-05 16:21:00 Jeff Squyres wroteï¼=======
>>
>>> I met this problem also, I think this is a bug of lam-7.1.1 but
>>> it has not been confirmed by LAM developers.
>
>> Are you sure that blcr support is correctly installed in your LAM
>> installation? Check the output of lamifo:
>>
>> shell$ laminfo | grep blcr
>> SSI cr: blcr (API v1.0, Module v1.1)
>>
>> If you see that "SSI" line, then blcr support is properly included in
>> your LAM installation. The question then becomes why the images were
>> not properly created when you cr_checkpointed mpirun.
>>
>> The best way to do this is to turn up the verbosity of the cr system
>> and ensure that everything is happening properly. For example:
>>
>> shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
>> cr blcr -np 2 your_application
>>
>> You initially should see a bunch of output to stderr indicating that
>> blcr was selected. When you invoke cr_checkpoint, you should see all
>> the steps that LAM goes through to checkpoint.
>>
>> Does that happen for you?
>>
>> --
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>
> = = = = = = = = = = = = = = = = = = = =
>
>
>
>
>
> _______________________________________________
> lam-devel mailing list
> lam-devel_at_[hidden]
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|