LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Paul Walters (jwalters_at_[hidden])
Date: 2006-04-10 21:18:03


Josh,

The addition of "level" to the verbose arguments is news indeed. I was
previously testing on the latest nightly tarball. When I added
"level:#" rather than just "#" to the verbose arguments, things stopped
crashing. The lack of the "level" must have caused a memory problem
somewhere.

That said, does the app_schema really need to be created in the case of
the "self" module? It never gets written to disk (as far as I can tell)
and mpirun doesn't get checkpointed. So while it doesn't seem to be
hurting anything, it might be unnecessary.

thanks,
JP

On Mon, 2006-04-10 at 20:24 -0400, Josh Hursey wrote:
> JP,
>
> I just relearned something interesting about LAM, that I'm shocked
> took me this long to discover. When specifying a verbose argument you
> shouldn't set just a number (-ssi cr_verbose 100) but rather level:#
> (e.g. -ssi cr_verbose level:100). The verbose arguments in LAM let
> you do some interesting things outside of setting the level of
> verbosity, like redirecting those verbose statements to a given
> output stream or file. For example:
> -ssi cr_verbose level:1000,stdout
> Should redirect all of the verbose output for cr_verbose to stdout if
> the level is marked at or below 1000.
>
> Neat huh? :) Unfortunately it is not a well documented feature. :/
>
> In any case, I'm running with the argument set that you sent, and I
> can't seem to replicate with one process. I was running with:
> $ mpirun -v \
> -ssi ssi_verbose 100 \
> -ssi rpi crtcp \
> -ssi cr self \
> -ssi cr_self_prefix my_personal \
> -ssi cr_verbose 100 \
> -np 1 \
> ./personal-cr
>
> However I did notice that if I run with more than one process I get a
> segfault sometimes when the program is exiting normally. Not sure
> what is causing that though.
>
> I noticed that there was a fix for the app schema code fairly
> recently. Do you see this problem with a newer install, like the
> 7.1.2 tarball or the nightly checkout?
>
> Some other options:
> - Have you tried emptying out your callbacks incase there is
> something in there that is breaking things?
> - You might also want to make sure you have a clean install just to
> make sure it is not linking in some old libraries.
>
> Sorry I'm not much more help at the moment.
>
> -- Josh
>
>
> On Apr 10, 2006, at 7:18 PM, John Paul Walters wrote:
>
> > Josh,
> >
> > I stumbled upon what's actually causing the crash on my machines over
> > here. As far as I know, this only happens with the 'self' checkpoint
> > module. Typically, I run my code with:
> >
> > mpirun -v -ssi ssi_verbose 100 -ssi rpi crtcp -ssi cr self -ssi
> > cr_self_prefix zandy -ssi cr_verbose 100 -np 2 ./tester
> >
> > If I issue a lamcheckpoint with mpirun's PID, mpirun crashes with a
> > segfault in the create_app_schema function. However, if I remove
> > either
> > the `-ssi cr_verbose 100` or the `-ssi verbose 100` arguments, things
> > seem to work fine. I now see that enabling `-ssi verbose 100` on
> > top of
> > cr_verbose doesn't gain anything, but are they supposed to be mutually
> > exclusive?
> >
> >
> > JP
> >
> >
> >
> >
> >
> > On Sun, 2006-04-09 at 14:33 -0400, Josh Hursey wrote:
> >> JP,
> >>
> >> I couldn't replicate the crash with the self module in 7.2b1svn10317.
> >> I just submitted a documentation and bug fix patch (with the help of
> >> Jeff Squyres) to the LAM/MPI trunk for the cc 'self' module. It is
> >> mostly a documentation commit, and a small bug fix. If you need the
> >> self module I would suggest using this build.
> >> Could you try again with either the latest nightly build (if you get
> >> it tomorrow when this change goes in [ [7.2b1svn10319]) or from the
> >> anonymous svn access (if you need it today)? Both you can get from
> >> here:
> >> http://www.lam-mpi.org/svn/
> >>
> >> Recently on this list, there have been some questions about the
> >> 'self' cr module in LAM/MPI. So the latest commit includes a lot of
> >> documentation about how to properly use this module. To bring people
> >> up to speed on this module on the list let me explain a bit about
> >> what it does.
> >>
> >> The self module invokes user-defined function to save and restore
> >> checkpoints. It allows for user-defined function to be invoked at
> >> LAM/
> >> MPI's Checkpoint, Continue, and Restart phases. Therefore, the only
> >> data that is saved during the checkpoint is what is written in the
> >> user's checkpoint function. No MPI library state is saved at all.
> >>
> >> - The Checkpoint Phase callback occurs just after the MPI library has
> >> been suspended, therefore using MPI commands inside the user-defined
> >> callback is not allowed.
> >>
> >> - The Continue Phase callback occurs just after the Checkpoint Phase
> >> to allow the application to receive notification that the application
> >> is continuing execution past a checkpoint. This tends to not be as
> >> useful when using the 'self' module, but is provided to be symmetric
> >> with the other modules.
> >>
> >> - The Restart Phase callback is invoked from within MPI_INIT upon
> >> restarting of a new instance of the application.
> >>
> >> In order to use the 'self' module it is critical that the application
> >> be compiled with the appropriate linker flags to export the symbols
> >> for the user-defined checkpoint/restart/continue functions. For GCC-
> >> based compilers it is the (-export) flag.
> >>
> >> To run an application you need to specify some SSI parameters:
> >> - cr_self_prefix
> >> This is the user-defined prefix for the checkpoint/restart/
> >> continue functions.
> >> for example:
> >> -ssi cr_self_prefix my_personal
> >> Will look for the functions
> >> int my_personal_checkpoint(void);
> >> int my_personal_continue(void);
> >> int my_personal_restart(void);
> >> By default (if you don't specify this SSI parameter), LAM/MPI
> >> looks for the functions with the 'lam_cr_self' prefix, of the form:
> >> int lam_cr_self_checkpoint(void);
> >> int lam_cr_self_continue(void);
> >> int lam_cr_self_restart(void);
> >> - cr_self_do_restart 1
> >> If this is set to '1' then the Restart Phase callback is invoked
> >> when the application is in MPI_INIT.
> >>
> >> For example, if my application was named 'personal-cr' to run it I
> >> would:
> >> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
> >> -ssi cr self personal_cr
> >>
> >> To checkpoint the application, you need only know the PID of the
> >> MPIRUN process. For example if the PID of MPIRUN in XXXX, then to
> >> checkpoint the application you need only:
> >> shell$ lamcheckpoint -ssi cr self -pid XXXX
> >>
> >> To restart the application, you need to use MPIRUN with the
> >> 'cr_self_do_restart' SSI parameter, as follows:
> >> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
> >> -ssi cr self -ssi cr_self_do_restart 1 personal_cr
> >>
> >> Notice that the original execution and the restart commands both use
> >> MPIRUN, and take the same arguments with the addition of the
> >> 'cr_self_do_restart' SSI parameter in the restart case. The
> >> lamrestart is command should not be used to restart the application
> >> at the moment, as it doesn't work as one might expect with the 'self'
> >> module. It does work fine with the other cr modules.
> >>
> >> There is an example piece of source code attached to the bottom of
> >> this message.
> >>
> >> I hope that helps get people started. Sorry for the delay on
> >> replying. Please send any questions you have to the list and I'll try
> >> to help as much as I can.
> >>
> >> Cheers,
> >> Josh
> >>
> >>
> >> <code example>
> >> /*
> >> * Example LAM/MPI CR self program
> >> * Josh Hursey
> >> * April 9, 2006
> >> *
> >> * To compile:
> >> * $ mpicc -g -export -o personal-cr personal-cr.c
> >> *
> >> * To Run:
> >> * $ mpirun -np 2 \
> >> * -ssi cr_self_prefix my_personal \
> >> * -ssi rpi crtcp \
> >> * -ssi cr self \
> >> * personal-cr
> >> *
> >> * To Checkpoint:
> >> * $ lamcheckpoint -ssi cr self -pid XXXX
> >> *
> >> * To Restart: (Don't use lamrestart)
> >> * $ mpirun -np 2 \
> >> * -ssi cr_self_prefix my_personal \
> >> * -ssi rpi crtcp \
> >> * -ssi cr self \
> >> * -ssi cr_self_do_restart 1
> >> * personal-cr
> >> */
> >> #include <stdio.h>
> >> #include <mpi.h>
> >>
> >> #define LIMIT 20
> >>
> >> /************************
> >> * Function Declarations
> >> ************************/
> >> /* Default LAM/MPI cr self callback functions */
> >> int lam_cr_self_checkpoint(void);
> >> int lam_cr_self_continue(void);
> >> int lam_cr_self_restart(void);
> >>
> >> /* LAM/MPI cr self callback functions */
> >> int my_personal_checkpoint(void);
> >> int my_personal_continue(void);
> >> int my_personal_restart(void);
> >>
> >> /*******************
> >> * Global Variables
> >> *******************/
> >> int current_step = 0;
> >> char ckpt_file[128] = "my-personal-cr-file.ckpt";
> >>
> >> /*********
> >> * Main
> >> *********/
> >> int main(int argc, char *argv[]) {
> >> int rank, size;
> >>
> >> current_step = 0;
> >>
> >> /*
> >> * If we are restarting, then our callback is called
> >> * from inside MPI_Init.
> >> * It is safe to initialize global variables before calling
> >> * MPI_Init with out fear of overwriting the values that
> >> * the restart callback might change them too.
> >> */
> >> MPI_Init(&argc, &argv);
> >> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >> MPI_Comm_size(MPI_COMM_WORLD, &size);
> >>
> >> for(; current_step < LIMIT; current_step += 1) {
> >> printf("%d of %d: Step %d\n", rank, size, current_step);
> >> sleep(1);
> >> }
> >>
> >> MPI_Finalize();
> >> return 0;
> >> }
> >>
> >> /* LAM/MPI default cr self callbacks for checkpoint */
> >> int lam_cr_self_checkpoint(void) {
> >> printf("lam_cr_self_checkpoint callback...\n");
> >>
> >> return 0;
> >> }
> >>
> >> int lam_cr_self_continue(void) {
> >> printf("lam_cr_self_continue callback...\n");
> >>
> >> return 0;
> >> }
> >>
> >> int lam_cr_self_restart(void) {
> >> printf("lam_cr_self_restart callback...\n");
> >>
> >> return 0;
> >> }
> >>
> >> /* LAM/MPI cr self callback for checkpoint */
> >> int my_personal_checkpoint() {
> >> FILE *fp;
> >>
> >> printf("my_personal_checkpoint callback...\n");
> >>
> >> /*
> >> * Open our checkpoint file
> >> */
> >> if( NULL == (fp = fopen(ckpt_file, "w")) ) {
> >> fprintf(stderr, "Error: Unable to open file (%s)\n",
> >> ckpt_file);
> >> return;
> >> }
> >>
> >> /*
> >> * Save the process state
> >> */
> >> fprintf(fp, "%d\n", current_step);
> >>
> >> /*
> >> * Close the checkpoint file
> >> */
> >> fclose(fp);
> >>
> >> return 0;
> >> }
> >>
> >> int my_personal_continue() {
> >> printf("my_personal_continue callback...\n");
> >> /* Don't need to do anything here since we are in the
> >> * state that we want to be in already.
> >> */
> >> return 0;
> >> }
> >>
> >> int my_personal_restart() {
> >> FILE *fp;
> >>
> >> printf("my_personal_restart callback...\n");
> >>
> >> /*
> >> * Open our checkpoint file
> >> */
> >> if( NULL == (fp = fopen(ckpt_file, "r")) ) {
> >> fprintf(stderr, "Error: Unable to open file (%s)\n",
> >> ckpt_file);
> >> return;
> >> }
> >>
> >> /*
> >> * Access the process state that we saved and
> >> * update the current step variable.
> >> */
> >> fscanf(fp, "%d", &current_step);
> >>
> >> fclose(fp);
> >>
> >> printf("my_personal_restart: Restarting from step %d\n",
> >> current_step);
> >>
> >> return 0;
> >> }
> >> </code example>
> >>
> >>
> >>
> >> On Apr 5, 2006, at 11:03 PM, John Paul Walters wrote:
> >>
> >>> Josh,
> >>>
> >>> Thanks for the response. I've been using LAM version
> >>> 7.2b1svn10281.
> >>>
> >>> Regards,
> >>> JP
> >>>
> >>> On Wed, 2006-04-05 at 22:22 -0400, Josh Hursey wrote:
> >>>> JP,
> >>>>
> >>>> I have been taking a look at the 'self' module due to the questions
> >>>> that have been asked on the list lately. Unfortunately I believe I
> >>>> have discovered a bug or two with that module. I'm taking a look at
> >>>> it now, and will reply back with more details and information.
> >>>>
> >>>> Which version of LAM/MPI are you using in which you encountered the
> >>>> problem that you highlighted?
> >>>>
> >>>> Sorry I can't be much more help at the moment, but I'll post back
> >>>> soon.
> >>>>
> >>>> Cheers,
> >>>> Josh
> >>>>
> >>>> On Apr 4, 2006, at 6:31 PM, John Paul Walters wrote:
> >>>>
> >>>>>
> >>>>> I have a couple of questions regarding the LAM "self"
> >>>>> checkpoint
> >>>>> module. The first problem that concerns me is a crash
> >>>>> that
> >>>>> occurs whenever a lamcheckpoint request is issued. The
> >>>>> segfault
> >>>>> occurs in ssi_crlam_self.c, when create_app_schema calls
> >>>>> free(tmp_as) (3rd from the last statement within
> >>>>> create_app_schema). I've gone as far as to remove the
> >>>>> calls to
> >>>>> the checkpointing library that I've provided to avoid any
> >>>>> possibility that my library is interfering, and replacing
> >>>>> the
> >>>>> calls with simple printfs. Could this be a bug in the
> >>>>> self
> >>>>> checkpoint module?
> >>>>>
> >>>>> Also, what functionality should my checkpointing library
> >>>>> provide
> >>>>> with respect to restarting the checkpointed MPI job?
> >>>>> Does my
> >>>>> checkpoint/restart library need to make a call to
> >>>>> MPI_Init()
> >>>>> upon restart? Short of that, how else can I reinitialize
> >>>>> the
> >>>>> communication channels?
> >>>>>
> >>>>> Thanks,
> >>>>> JP
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>>>
> >>>> ----
> >>>> Josh Hursey
> >>>> jjhursey_at_[hidden]
> >>>> http://www.lam-mpi.org/
> >>>>
> >>>> _______________________________________________
> >>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>>>
> >>>
> >>> _______________________________________________
> >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >> ----
> >> Josh Hursey
> >> jjhursey_at_[hidden]
> >> http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>