I have a couple of questions regarding the LAM "self" checkpoint
module. The first problem that concerns me is a crash that
occurs whenever a lamcheckpoint request is issued. The segfault
occurs in ssi_crlam_self.c, when create_app_schema calls
free(tmp_as) (3rd from the last statement within
create_app_schema). I've gone as far as to remove the calls to
the checkpointing library that I've provided to avoid any
possibility that my library is interfering, and replacing the
calls with simple printfs. Could this be a bug in the self
checkpoint module?
Also, what functionality should my checkpointing library provide
with respect to restarting the checkpointed MPI job? Does my
checkpoint/restart library need to make a call to MPI_Init()
upon restart? Short of that, how else can I reinitialize the
communication channels?
Thanks,
JP
|