LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Liu Xuezhao (lxz_at_[hidden])
Date: 2006-01-06 03:11:32


Hi,
        I think the problem met by Yuan Tang is same with me. Now i have resolved it.
    By tracing the soucecode of lam-7.1.1, I found the reason is mpirun can't received the correct parameters from LAM's initialization. The crlam module name received by mpirun is "none" but not the expectant "blcr".
    Int the file: /share/mpi/lammpiinit.c, in the function "lam_send_selected_ssi_modules", at the line of 571, the codes is:
---------------
/*
   * Also copy the selected CRMPI module's name to send to mpirun.
   *
   * It is possible that no CR modules were selected. So handle that case.
   */
#if 0
  if (lam_ssi_crmpi_base_available != NULL)
    module = (lam_ssi_module_t *) al_top(lam_ssi_crmpi_base_available);
#else
  /* JMS, for the moment, due to bug 682, we're just going to skip
     checking cr modules. */
  module = NULL;
#endif
-------------------
    The same section of lam-7.0.6 is:
-------------------
/*
   * Also copy the selected CRMPI module's name to send to mpirun.
   *
   * It is possible that no CR modules were selected. So handle that case.
   */
  if (lam_ssi_crmpi_base_available != NULL)
    module = (lam_ssi_module_t *) al_top(lam_ssi_crmpi_base_available);
-------------------
    I don't konw why the 2 lines code is been annotated and changed to "module = NULL;" at 7.1.1, perhaps there are some other reasons let the LAM developers to do that change.
    I changed it back, let the "#if 0" to be "#if 1" only. And recompiled and reinstalled it.
    Now the mpirun can received the correct parameters, and the applications can be cr_checkpoint and cr_restart correctly.
    Thanks.

Xuezhao
2006-01-06
======= 2006-01-05 16:21:00 Jeff Squyres wrote£º=======
>
>> I met this problem also, I think this is a bug of lam-7.1.1 but
>> it has not been confirmed by LAM developers.

>Are you sure that blcr support is correctly installed in your LAM
>installation? Check the output of lamifo:
>
>shell$ laminfo | grep blcr
> SSI cr: blcr (API v1.0, Module v1.1)
>
>If you see that "SSI" line, then blcr support is properly included in
>your LAM installation. The question then becomes why the images were
>not properly created when you cr_checkpointed mpirun.
>
>The best way to do this is to turn up the verbosity of the cr system
>and ensure that everything is happening properly. For example:
>
>shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
>cr blcr -np 2 your_application
>
>You initially should see a bunch of output to stderr indicating that
>blcr was selected. When you invoke cr_checkpoint, you should see all
>the steps that LAM goes through to checkpoint.
>
>Does that happen for you?
>
>--
>{+} Jeff Squyres
>{+} The Open MPI Project
>{+} http://www.open-mpi.org/
>

= = = = = = = = = = = = = = = = = = = =