Josh,
I stumbled upon what's actually causing the crash on my machines over
here. As far as I know, this only happens with the 'self' checkpoint
module. Typically, I run my code with:
mpirun -v -ssi ssi_verbose 100 -ssi rpi crtcp -ssi cr self -ssi
cr_self_prefix zandy -ssi cr_verbose 100 -np 2 ./tester
If I issue a lamcheckpoint with mpirun's PID, mpirun crashes with a
segfault in the create_app_schema function. However, if I remove either
the `-ssi cr_verbose 100` or the `-ssi verbose 100` arguments, things
seem to work fine. I now see that enabling `-ssi verbose 100` on top of
cr_verbose doesn't gain anything, but are they supposed to be mutually
exclusive?
JP
On Sun, 2006-04-09 at 14:33 -0400, Josh Hursey wrote:
> JP,
>
> I couldn't replicate the crash with the self module in 7.2b1svn10317.
> I just submitted a documentation and bug fix patch (with the help of
> Jeff Squyres) to the LAM/MPI trunk for the cc 'self' module. It is
> mostly a documentation commit, and a small bug fix. If you need the
> self module I would suggest using this build.
> Could you try again with either the latest nightly build (if you get
> it tomorrow when this change goes in [ [7.2b1svn10319]) or from the
> anonymous svn access (if you need it today)? Both you can get from here:
> http://www.lam-mpi.org/svn/
>
> Recently on this list, there have been some questions about the
> 'self' cr module in LAM/MPI. So the latest commit includes a lot of
> documentation about how to properly use this module. To bring people
> up to speed on this module on the list let me explain a bit about
> what it does.
>
> The self module invokes user-defined function to save and restore
> checkpoints. It allows for user-defined function to be invoked at LAM/
> MPI's Checkpoint, Continue, and Restart phases. Therefore, the only
> data that is saved during the checkpoint is what is written in the
> user's checkpoint function. No MPI library state is saved at all.
>
> - The Checkpoint Phase callback occurs just after the MPI library has
> been suspended, therefore using MPI commands inside the user-defined
> callback is not allowed.
>
> - The Continue Phase callback occurs just after the Checkpoint Phase
> to allow the application to receive notification that the application
> is continuing execution past a checkpoint. This tends to not be as
> useful when using the 'self' module, but is provided to be symmetric
> with the other modules.
>
> - The Restart Phase callback is invoked from within MPI_INIT upon
> restarting of a new instance of the application.
>
> In order to use the 'self' module it is critical that the application
> be compiled with the appropriate linker flags to export the symbols
> for the user-defined checkpoint/restart/continue functions. For GCC-
> based compilers it is the (-export) flag.
>
> To run an application you need to specify some SSI parameters:
> - cr_self_prefix
> This is the user-defined prefix for the checkpoint/restart/
> continue functions.
> for example:
> -ssi cr_self_prefix my_personal
> Will look for the functions
> int my_personal_checkpoint(void);
> int my_personal_continue(void);
> int my_personal_restart(void);
> By default (if you don't specify this SSI parameter), LAM/MPI
> looks for the functions with the 'lam_cr_self' prefix, of the form:
> int lam_cr_self_checkpoint(void);
> int lam_cr_self_continue(void);
> int lam_cr_self_restart(void);
> - cr_self_do_restart 1
> If this is set to '1' then the Restart Phase callback is invoked
> when the application is in MPI_INIT.
>
> For example, if my application was named 'personal-cr' to run it I
> would:
> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
> -ssi cr self personal_cr
>
> To checkpoint the application, you need only know the PID of the
> MPIRUN process. For example if the PID of MPIRUN in XXXX, then to
> checkpoint the application you need only:
> shell$ lamcheckpoint -ssi cr self -pid XXXX
>
> To restart the application, you need to use MPIRUN with the
> 'cr_self_do_restart' SSI parameter, as follows:
> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
> -ssi cr self -ssi cr_self_do_restart 1 personal_cr
>
> Notice that the original execution and the restart commands both use
> MPIRUN, and take the same arguments with the addition of the
> 'cr_self_do_restart' SSI parameter in the restart case. The
> lamrestart is command should not be used to restart the application
> at the moment, as it doesn't work as one might expect with the 'self'
> module. It does work fine with the other cr modules.
>
> There is an example piece of source code attached to the bottom of
> this message.
>
> I hope that helps get people started. Sorry for the delay on
> replying. Please send any questions you have to the list and I'll try
> to help as much as I can.
>
> Cheers,
> Josh
>
>
> <code example>
> /*
> * Example LAM/MPI CR self program
> * Josh Hursey
> * April 9, 2006
> *
> * To compile:
> * $ mpicc -g -export -o personal-cr personal-cr.c
> *
> * To Run:
> * $ mpirun -np 2 \
> * -ssi cr_self_prefix my_personal \
> * -ssi rpi crtcp \
> * -ssi cr self \
> * personal-cr
> *
> * To Checkpoint:
> * $ lamcheckpoint -ssi cr self -pid XXXX
> *
> * To Restart: (Don't use lamrestart)
> * $ mpirun -np 2 \
> * -ssi cr_self_prefix my_personal \
> * -ssi rpi crtcp \
> * -ssi cr self \
> * -ssi cr_self_do_restart 1
> * personal-cr
> */
> #include <stdio.h>
> #include <mpi.h>
>
> #define LIMIT 20
>
> /************************
> * Function Declarations
> ************************/
> /* Default LAM/MPI cr self callback functions */
> int lam_cr_self_checkpoint(void);
> int lam_cr_self_continue(void);
> int lam_cr_self_restart(void);
>
> /* LAM/MPI cr self callback functions */
> int my_personal_checkpoint(void);
> int my_personal_continue(void);
> int my_personal_restart(void);
>
> /*******************
> * Global Variables
> *******************/
> int current_step = 0;
> char ckpt_file[128] = "my-personal-cr-file.ckpt";
>
> /*********
> * Main
> *********/
> int main(int argc, char *argv[]) {
> int rank, size;
>
> current_step = 0;
>
> /*
> * If we are restarting, then our callback is called
> * from inside MPI_Init.
> * It is safe to initialize global variables before calling
> * MPI_Init with out fear of overwriting the values that
> * the restart callback might change them too.
> */
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
>
> for(; current_step < LIMIT; current_step += 1) {
> printf("%d of %d: Step %d\n", rank, size, current_step);
> sleep(1);
> }
>
> MPI_Finalize();
> return 0;
> }
>
> /* LAM/MPI default cr self callbacks for checkpoint */
> int lam_cr_self_checkpoint(void) {
> printf("lam_cr_self_checkpoint callback...\n");
>
> return 0;
> }
>
> int lam_cr_self_continue(void) {
> printf("lam_cr_self_continue callback...\n");
>
> return 0;
> }
>
> int lam_cr_self_restart(void) {
> printf("lam_cr_self_restart callback...\n");
>
> return 0;
> }
>
> /* LAM/MPI cr self callback for checkpoint */
> int my_personal_checkpoint() {
> FILE *fp;
>
> printf("my_personal_checkpoint callback...\n");
>
> /*
> * Open our checkpoint file
> */
> if( NULL == (fp = fopen(ckpt_file, "w")) ) {
> fprintf(stderr, "Error: Unable to open file (%s)\n",
> ckpt_file);
> return;
> }
>
> /*
> * Save the process state
> */
> fprintf(fp, "%d\n", current_step);
>
> /*
> * Close the checkpoint file
> */
> fclose(fp);
>
> return 0;
> }
>
> int my_personal_continue() {
> printf("my_personal_continue callback...\n");
> /* Don't need to do anything here since we are in the
> * state that we want to be in already.
> */
> return 0;
> }
>
> int my_personal_restart() {
> FILE *fp;
>
> printf("my_personal_restart callback...\n");
>
> /*
> * Open our checkpoint file
> */
> if( NULL == (fp = fopen(ckpt_file, "r")) ) {
> fprintf(stderr, "Error: Unable to open file (%s)\n",
> ckpt_file);
> return;
> }
>
> /*
> * Access the process state that we saved and
> * update the current step variable.
> */
> fscanf(fp, "%d", ¤t_step);
>
> fclose(fp);
>
> printf("my_personal_restart: Restarting from step %d\n",
> current_step);
>
> return 0;
> }
> </code example>
>
>
>
> On Apr 5, 2006, at 11:03 PM, John Paul Walters wrote:
>
> > Josh,
> >
> > Thanks for the response. I've been using LAM version
> > 7.2b1svn10281.
> >
> > Regards,
> > JP
> >
> > On Wed, 2006-04-05 at 22:22 -0400, Josh Hursey wrote:
> >> JP,
> >>
> >> I have been taking a look at the 'self' module due to the questions
> >> that have been asked on the list lately. Unfortunately I believe I
> >> have discovered a bug or two with that module. I'm taking a look at
> >> it now, and will reply back with more details and information.
> >>
> >> Which version of LAM/MPI are you using in which you encountered the
> >> problem that you highlighted?
> >>
> >> Sorry I can't be much more help at the moment, but I'll post back
> >> soon.
> >>
> >> Cheers,
> >> Josh
> >>
> >> On Apr 4, 2006, at 6:31 PM, John Paul Walters wrote:
> >>
> >>>
> >>> I have a couple of questions regarding the LAM "self"
> >>> checkpoint
> >>> module. The first problem that concerns me is a crash that
> >>> occurs whenever a lamcheckpoint request is issued. The
> >>> segfault
> >>> occurs in ssi_crlam_self.c, when create_app_schema calls
> >>> free(tmp_as) (3rd from the last statement within
> >>> create_app_schema). I've gone as far as to remove the
> >>> calls to
> >>> the checkpointing library that I've provided to avoid any
> >>> possibility that my library is interfering, and replacing
> >>> the
> >>> calls with simple printfs. Could this be a bug in the self
> >>> checkpoint module?
> >>>
> >>> Also, what functionality should my checkpointing library
> >>> provide
> >>> with respect to restarting the checkpointed MPI job?
> >>> Does my
> >>> checkpoint/restart library need to make a call to MPI_Init()
> >>> upon restart? Short of that, how else can I reinitialize
> >>> the
> >>> communication channels?
> >>>
> >>> Thanks,
> >>> JP
> >>>
> >>>
> >>> _______________________________________________
> >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >> ----
> >> Josh Hursey
> >> jjhursey_at_[hidden]
> >> http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|