JP,
I just relearned something interesting about LAM, that I'm shocked
took me this long to discover. When specifying a verbose argument you
shouldn't set just a number (-ssi cr_verbose 100) but rather level:#
(e.g. -ssi cr_verbose level:100). The verbose arguments in LAM let
you do some interesting things outside of setting the level of
verbosity, like redirecting those verbose statements to a given
output stream or file. For example:
-ssi cr_verbose level:1000,stdout
Should redirect all of the verbose output for cr_verbose to stdout if
the level is marked at or below 1000.
Neat huh? :) Unfortunately it is not a well documented feature. :/
In any case, I'm running with the argument set that you sent, and I
can't seem to replicate with one process. I was running with:
$ mpirun -v \
-ssi ssi_verbose 100 \
-ssi rpi crtcp \
-ssi cr self \
-ssi cr_self_prefix my_personal \
-ssi cr_verbose 100 \
-np 1 \
./personal-cr
However I did notice that if I run with more than one process I get a
segfault sometimes when the program is exiting normally. Not sure
what is causing that though.
I noticed that there was a fix for the app schema code fairly
recently. Do you see this problem with a newer install, like the
7.1.2 tarball or the nightly checkout?
Some other options:
- Have you tried emptying out your callbacks incase there is
something in there that is breaking things?
- You might also want to make sure you have a clean install just to
make sure it is not linking in some old libraries.
Sorry I'm not much more help at the moment.
-- Josh
On Apr 10, 2006, at 7:18 PM, John Paul Walters wrote:
> Josh,
>
> I stumbled upon what's actually causing the crash on my machines over
> here. As far as I know, this only happens with the 'self' checkpoint
> module. Typically, I run my code with:
>
> mpirun -v -ssi ssi_verbose 100 -ssi rpi crtcp -ssi cr self -ssi
> cr_self_prefix zandy -ssi cr_verbose 100 -np 2 ./tester
>
> If I issue a lamcheckpoint with mpirun's PID, mpirun crashes with a
> segfault in the create_app_schema function. However, if I remove
> either
> the `-ssi cr_verbose 100` or the `-ssi verbose 100` arguments, things
> seem to work fine. I now see that enabling `-ssi verbose 100` on
> top of
> cr_verbose doesn't gain anything, but are they supposed to be mutually
> exclusive?
>
>
> JP
>
>
>
>
>
> On Sun, 2006-04-09 at 14:33 -0400, Josh Hursey wrote:
>> JP,
>>
>> I couldn't replicate the crash with the self module in 7.2b1svn10317.
>> I just submitted a documentation and bug fix patch (with the help of
>> Jeff Squyres) to the LAM/MPI trunk for the cc 'self' module. It is
>> mostly a documentation commit, and a small bug fix. If you need the
>> self module I would suggest using this build.
>> Could you try again with either the latest nightly build (if you get
>> it tomorrow when this change goes in [ [7.2b1svn10319]) or from the
>> anonymous svn access (if you need it today)? Both you can get from
>> here:
>> http://www.lam-mpi.org/svn/
>>
>> Recently on this list, there have been some questions about the
>> 'self' cr module in LAM/MPI. So the latest commit includes a lot of
>> documentation about how to properly use this module. To bring people
>> up to speed on this module on the list let me explain a bit about
>> what it does.
>>
>> The self module invokes user-defined function to save and restore
>> checkpoints. It allows for user-defined function to be invoked at
>> LAM/
>> MPI's Checkpoint, Continue, and Restart phases. Therefore, the only
>> data that is saved during the checkpoint is what is written in the
>> user's checkpoint function. No MPI library state is saved at all.
>>
>> - The Checkpoint Phase callback occurs just after the MPI library has
>> been suspended, therefore using MPI commands inside the user-defined
>> callback is not allowed.
>>
>> - The Continue Phase callback occurs just after the Checkpoint Phase
>> to allow the application to receive notification that the application
>> is continuing execution past a checkpoint. This tends to not be as
>> useful when using the 'self' module, but is provided to be symmetric
>> with the other modules.
>>
>> - The Restart Phase callback is invoked from within MPI_INIT upon
>> restarting of a new instance of the application.
>>
>> In order to use the 'self' module it is critical that the application
>> be compiled with the appropriate linker flags to export the symbols
>> for the user-defined checkpoint/restart/continue functions. For GCC-
>> based compilers it is the (-export) flag.
>>
>> To run an application you need to specify some SSI parameters:
>> - cr_self_prefix
>> This is the user-defined prefix for the checkpoint/restart/
>> continue functions.
>> for example:
>> -ssi cr_self_prefix my_personal
>> Will look for the functions
>> int my_personal_checkpoint(void);
>> int my_personal_continue(void);
>> int my_personal_restart(void);
>> By default (if you don't specify this SSI parameter), LAM/MPI
>> looks for the functions with the 'lam_cr_self' prefix, of the form:
>> int lam_cr_self_checkpoint(void);
>> int lam_cr_self_continue(void);
>> int lam_cr_self_restart(void);
>> - cr_self_do_restart 1
>> If this is set to '1' then the Restart Phase callback is invoked
>> when the application is in MPI_INIT.
>>
>> For example, if my application was named 'personal-cr' to run it I
>> would:
>> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
>> -ssi cr self personal_cr
>>
>> To checkpoint the application, you need only know the PID of the
>> MPIRUN process. For example if the PID of MPIRUN in XXXX, then to
>> checkpoint the application you need only:
>> shell$ lamcheckpoint -ssi cr self -pid XXXX
>>
>> To restart the application, you need to use MPIRUN with the
>> 'cr_self_do_restart' SSI parameter, as follows:
>> shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
>> -ssi cr self -ssi cr_self_do_restart 1 personal_cr
>>
>> Notice that the original execution and the restart commands both use
>> MPIRUN, and take the same arguments with the addition of the
>> 'cr_self_do_restart' SSI parameter in the restart case. The
>> lamrestart is command should not be used to restart the application
>> at the moment, as it doesn't work as one might expect with the 'self'
>> module. It does work fine with the other cr modules.
>>
>> There is an example piece of source code attached to the bottom of
>> this message.
>>
>> I hope that helps get people started. Sorry for the delay on
>> replying. Please send any questions you have to the list and I'll try
>> to help as much as I can.
>>
>> Cheers,
>> Josh
>>
>>
>> <code example>
>> /*
>> * Example LAM/MPI CR self program
>> * Josh Hursey
>> * April 9, 2006
>> *
>> * To compile:
>> * $ mpicc -g -export -o personal-cr personal-cr.c
>> *
>> * To Run:
>> * $ mpirun -np 2 \
>> * -ssi cr_self_prefix my_personal \
>> * -ssi rpi crtcp \
>> * -ssi cr self \
>> * personal-cr
>> *
>> * To Checkpoint:
>> * $ lamcheckpoint -ssi cr self -pid XXXX
>> *
>> * To Restart: (Don't use lamrestart)
>> * $ mpirun -np 2 \
>> * -ssi cr_self_prefix my_personal \
>> * -ssi rpi crtcp \
>> * -ssi cr self \
>> * -ssi cr_self_do_restart 1
>> * personal-cr
>> */
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> #define LIMIT 20
>>
>> /************************
>> * Function Declarations
>> ************************/
>> /* Default LAM/MPI cr self callback functions */
>> int lam_cr_self_checkpoint(void);
>> int lam_cr_self_continue(void);
>> int lam_cr_self_restart(void);
>>
>> /* LAM/MPI cr self callback functions */
>> int my_personal_checkpoint(void);
>> int my_personal_continue(void);
>> int my_personal_restart(void);
>>
>> /*******************
>> * Global Variables
>> *******************/
>> int current_step = 0;
>> char ckpt_file[128] = "my-personal-cr-file.ckpt";
>>
>> /*********
>> * Main
>> *********/
>> int main(int argc, char *argv[]) {
>> int rank, size;
>>
>> current_step = 0;
>>
>> /*
>> * If we are restarting, then our callback is called
>> * from inside MPI_Init.
>> * It is safe to initialize global variables before calling
>> * MPI_Init with out fear of overwriting the values that
>> * the restart callback might change them too.
>> */
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>
>> for(; current_step < LIMIT; current_step += 1) {
>> printf("%d of %d: Step %d\n", rank, size, current_step);
>> sleep(1);
>> }
>>
>> MPI_Finalize();
>> return 0;
>> }
>>
>> /* LAM/MPI default cr self callbacks for checkpoint */
>> int lam_cr_self_checkpoint(void) {
>> printf("lam_cr_self_checkpoint callback...\n");
>>
>> return 0;
>> }
>>
>> int lam_cr_self_continue(void) {
>> printf("lam_cr_self_continue callback...\n");
>>
>> return 0;
>> }
>>
>> int lam_cr_self_restart(void) {
>> printf("lam_cr_self_restart callback...\n");
>>
>> return 0;
>> }
>>
>> /* LAM/MPI cr self callback for checkpoint */
>> int my_personal_checkpoint() {
>> FILE *fp;
>>
>> printf("my_personal_checkpoint callback...\n");
>>
>> /*
>> * Open our checkpoint file
>> */
>> if( NULL == (fp = fopen(ckpt_file, "w")) ) {
>> fprintf(stderr, "Error: Unable to open file (%s)\n",
>> ckpt_file);
>> return;
>> }
>>
>> /*
>> * Save the process state
>> */
>> fprintf(fp, "%d\n", current_step);
>>
>> /*
>> * Close the checkpoint file
>> */
>> fclose(fp);
>>
>> return 0;
>> }
>>
>> int my_personal_continue() {
>> printf("my_personal_continue callback...\n");
>> /* Don't need to do anything here since we are in the
>> * state that we want to be in already.
>> */
>> return 0;
>> }
>>
>> int my_personal_restart() {
>> FILE *fp;
>>
>> printf("my_personal_restart callback...\n");
>>
>> /*
>> * Open our checkpoint file
>> */
>> if( NULL == (fp = fopen(ckpt_file, "r")) ) {
>> fprintf(stderr, "Error: Unable to open file (%s)\n",
>> ckpt_file);
>> return;
>> }
>>
>> /*
>> * Access the process state that we saved and
>> * update the current step variable.
>> */
>> fscanf(fp, "%d", ¤t_step);
>>
>> fclose(fp);
>>
>> printf("my_personal_restart: Restarting from step %d\n",
>> current_step);
>>
>> return 0;
>> }
>> </code example>
>>
>>
>>
>> On Apr 5, 2006, at 11:03 PM, John Paul Walters wrote:
>>
>>> Josh,
>>>
>>> Thanks for the response. I've been using LAM version
>>> 7.2b1svn10281.
>>>
>>> Regards,
>>> JP
>>>
>>> On Wed, 2006-04-05 at 22:22 -0400, Josh Hursey wrote:
>>>> JP,
>>>>
>>>> I have been taking a look at the 'self' module due to the questions
>>>> that have been asked on the list lately. Unfortunately I believe I
>>>> have discovered a bug or two with that module. I'm taking a look at
>>>> it now, and will reply back with more details and information.
>>>>
>>>> Which version of LAM/MPI are you using in which you encountered the
>>>> problem that you highlighted?
>>>>
>>>> Sorry I can't be much more help at the moment, but I'll post back
>>>> soon.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Apr 4, 2006, at 6:31 PM, John Paul Walters wrote:
>>>>
>>>>>
>>>>> I have a couple of questions regarding the LAM "self"
>>>>> checkpoint
>>>>> module. The first problem that concerns me is a crash
>>>>> that
>>>>> occurs whenever a lamcheckpoint request is issued. The
>>>>> segfault
>>>>> occurs in ssi_crlam_self.c, when create_app_schema calls
>>>>> free(tmp_as) (3rd from the last statement within
>>>>> create_app_schema). I've gone as far as to remove the
>>>>> calls to
>>>>> the checkpointing library that I've provided to avoid any
>>>>> possibility that my library is interfering, and replacing
>>>>> the
>>>>> calls with simple printfs. Could this be a bug in the
>>>>> self
>>>>> checkpoint module?
>>>>>
>>>>> Also, what functionality should my checkpointing library
>>>>> provide
>>>>> with respect to restarting the checkpointed MPI job?
>>>>> Does my
>>>>> checkpoint/restart library need to make a call to
>>>>> MPI_Init()
>>>>> upon restart? Short of that, how else can I reinitialize
>>>>> the
>>>>> communication channels?
>>>>>
>>>>> Thanks,
>>>>> JP
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>> ----
>>>> Josh Hursey
>>>> jjhursey_at_[hidden]
>>>> http://www.lam-mpi.org/
>>>>
>>>> _______________________________________________
>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>> ----
>> Josh Hursey
>> jjhursey_at_[hidden]
>> http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
----
Josh Hursey
jjhursey_at_[hidden]
http://www.lam-mpi.org/
|