JP,
I couldn't replicate the crash with the self module in 7.2b1svn10317.
I just submitted a documentation and bug fix patch (with the help of
Jeff Squyres) to the LAM/MPI trunk for the cc 'self' module. It is
mostly a documentation commit, and a small bug fix. If you need the
self module I would suggest using this build.
Could you try again with either the latest nightly build (if you get
it tomorrow when this change goes in [ [7.2b1svn10319]) or from the
anonymous svn access (if you need it today)? Both you can get from here:
http://www.lam-mpi.org/svn/
Recently on this list, there have been some questions about the
'self' cr module in LAM/MPI. So the latest commit includes a lot of
documentation about how to properly use this module. To bring people
up to speed on this module on the list let me explain a bit about
what it does.
The self module invokes user-defined function to save and restore
checkpoints. It allows for user-defined function to be invoked at LAM/
MPI's Checkpoint, Continue, and Restart phases. Therefore, the only
data that is saved during the checkpoint is what is written in the
user's checkpoint function. No MPI library state is saved at all.
- The Checkpoint Phase callback occurs just after the MPI library has
been suspended, therefore using MPI commands inside the user-defined
callback is not allowed.
- The Continue Phase callback occurs just after the Checkpoint Phase
to allow the application to receive notification that the application
is continuing execution past a checkpoint. This tends to not be as
useful when using the 'self' module, but is provided to be symmetric
with the other modules.
- The Restart Phase callback is invoked from within MPI_INIT upon
restarting of a new instance of the application.
In order to use the 'self' module it is critical that the application
be compiled with the appropriate linker flags to export the symbols
for the user-defined checkpoint/restart/continue functions. For GCC-
based compilers it is the (-export) flag.
To run an application you need to specify some SSI parameters:
- cr_self_prefix
This is the user-defined prefix for the checkpoint/restart/
continue functions.
for example:
-ssi cr_self_prefix my_personal
Will look for the functions
int my_personal_checkpoint(void);
int my_personal_continue(void);
int my_personal_restart(void);
By default (if you don't specify this SSI parameter), LAM/MPI
looks for the functions with the 'lam_cr_self' prefix, of the form:
int lam_cr_self_checkpoint(void);
int lam_cr_self_continue(void);
int lam_cr_self_restart(void);
- cr_self_do_restart 1
If this is set to '1' then the Restart Phase callback is invoked
when the application is in MPI_INIT.
For example, if my application was named 'personal-cr' to run it I
would:
shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
-ssi cr self personal_cr
To checkpoint the application, you need only know the PID of the
MPIRUN process. For example if the PID of MPIRUN in XXXX, then to
checkpoint the application you need only:
shell$ lamcheckpoint -ssi cr self -pid XXXX
To restart the application, you need to use MPIRUN with the
'cr_self_do_restart' SSI parameter, as follows:
shell$ mpirun -np 2 -ssi cr_self_prefix my_personal -ssi rpi crtcp
-ssi cr self -ssi cr_self_do_restart 1 personal_cr
Notice that the original execution and the restart commands both use
MPIRUN, and take the same arguments with the addition of the
'cr_self_do_restart' SSI parameter in the restart case. The
lamrestart is command should not be used to restart the application
at the moment, as it doesn't work as one might expect with the 'self'
module. It does work fine with the other cr modules.
There is an example piece of source code attached to the bottom of
this message.
I hope that helps get people started. Sorry for the delay on
replying. Please send any questions you have to the list and I'll try
to help as much as I can.
Cheers,
Josh
<code example>
/*
* Example LAM/MPI CR self program
* Josh Hursey
* April 9, 2006
*
* To compile:
* $ mpicc -g -export -o personal-cr personal-cr.c
*
* To Run:
* $ mpirun -np 2 \
* -ssi cr_self_prefix my_personal \
* -ssi rpi crtcp \
* -ssi cr self \
* personal-cr
*
* To Checkpoint:
* $ lamcheckpoint -ssi cr self -pid XXXX
*
* To Restart: (Don't use lamrestart)
* $ mpirun -np 2 \
* -ssi cr_self_prefix my_personal \
* -ssi rpi crtcp \
* -ssi cr self \
* -ssi cr_self_do_restart 1
* personal-cr
*/
#include <stdio.h>
#include <mpi.h>
#define LIMIT 20
/************************
* Function Declarations
************************/
/* Default LAM/MPI cr self callback functions */
int lam_cr_self_checkpoint(void);
int lam_cr_self_continue(void);
int lam_cr_self_restart(void);
/* LAM/MPI cr self callback functions */
int my_personal_checkpoint(void);
int my_personal_continue(void);
int my_personal_restart(void);
/*******************
* Global Variables
*******************/
int current_step = 0;
char ckpt_file[128] = "my-personal-cr-file.ckpt";
/*********
* Main
*********/
int main(int argc, char *argv[]) {
int rank, size;
current_step = 0;
/*
* If we are restarting, then our callback is called
* from inside MPI_Init.
* It is safe to initialize global variables before calling
* MPI_Init with out fear of overwriting the values that
* the restart callback might change them too.
*/
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
for(; current_step < LIMIT; current_step += 1) {
printf("%d of %d: Step %d\n", rank, size, current_step);
sleep(1);
}
MPI_Finalize();
return 0;
}
/* LAM/MPI default cr self callbacks for checkpoint */
int lam_cr_self_checkpoint(void) {
printf("lam_cr_self_checkpoint callback...\n");
return 0;
}
int lam_cr_self_continue(void) {
printf("lam_cr_self_continue callback...\n");
return 0;
}
int lam_cr_self_restart(void) {
printf("lam_cr_self_restart callback...\n");
return 0;
}
/* LAM/MPI cr self callback for checkpoint */
int my_personal_checkpoint() {
FILE *fp;
printf("my_personal_checkpoint callback...\n");
/*
* Open our checkpoint file
*/
if( NULL == (fp = fopen(ckpt_file, "w")) ) {
fprintf(stderr, "Error: Unable to open file (%s)\n",
ckpt_file);
return;
}
/*
* Save the process state
*/
fprintf(fp, "%d\n", current_step);
/*
* Close the checkpoint file
*/
fclose(fp);
return 0;
}
int my_personal_continue() {
printf("my_personal_continue callback...\n");
/* Don't need to do anything here since we are in the
* state that we want to be in already.
*/
return 0;
}
int my_personal_restart() {
FILE *fp;
printf("my_personal_restart callback...\n");
/*
* Open our checkpoint file
*/
if( NULL == (fp = fopen(ckpt_file, "r")) ) {
fprintf(stderr, "Error: Unable to open file (%s)\n",
ckpt_file);
return;
}
/*
* Access the process state that we saved and
* update the current step variable.
*/
fscanf(fp, "%d", ¤t_step);
fclose(fp);
printf("my_personal_restart: Restarting from step %d\n",
current_step);
return 0;
}
</code example>
On Apr 5, 2006, at 11:03 PM, John Paul Walters wrote:
> Josh,
>
> Thanks for the response. I've been using LAM version
> 7.2b1svn10281.
>
> Regards,
> JP
>
> On Wed, 2006-04-05 at 22:22 -0400, Josh Hursey wrote:
>> JP,
>>
>> I have been taking a look at the 'self' module due to the questions
>> that have been asked on the list lately. Unfortunately I believe I
>> have discovered a bug or two with that module. I'm taking a look at
>> it now, and will reply back with more details and information.
>>
>> Which version of LAM/MPI are you using in which you encountered the
>> problem that you highlighted?
>>
>> Sorry I can't be much more help at the moment, but I'll post back
>> soon.
>>
>> Cheers,
>> Josh
>>
>> On Apr 4, 2006, at 6:31 PM, John Paul Walters wrote:
>>
>>>
>>> I have a couple of questions regarding the LAM "self"
>>> checkpoint
>>> module. The first problem that concerns me is a crash that
>>> occurs whenever a lamcheckpoint request is issued. The
>>> segfault
>>> occurs in ssi_crlam_self.c, when create_app_schema calls
>>> free(tmp_as) (3rd from the last statement within
>>> create_app_schema). I've gone as far as to remove the
>>> calls to
>>> the checkpointing library that I've provided to avoid any
>>> possibility that my library is interfering, and replacing
>>> the
>>> calls with simple printfs. Could this be a bug in the self
>>> checkpoint module?
>>>
>>> Also, what functionality should my checkpointing library
>>> provide
>>> with respect to restarting the checkpointed MPI job?
>>> Does my
>>> checkpoint/restart library need to make a call to MPI_Init()
>>> upon restart? Short of that, how else can I reinitialize
>>> the
>>> communication channels?
>>>
>>> Thanks,
>>> JP
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>> ----
>> Josh Hursey
>> jjhursey_at_[hidden]
>> http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
----
Josh Hursey
jjhursey_at_[hidden]
http://www.lam-mpi.org/
|