Hi!
Since I am new to this group, I'd like to welcome all the members :-)
I am currently developing a solution for checkpointing/restarting
(possibly migrating) MPI-enabled applications. I've decided to use a
user-level checkpointing library ckpt to freeze the process state.
My thought was, that after being given a specific signal, the
application (meaning all the distributed processes) chooses a safe
point, excluding all the risks of loosing travelling messages, and
performs a distributed checkpoint of each process separately, preceded
by calling MPI_Finalize. After the restart of each node, the processes
are aware of being restarted (that includes several modifications of
the execution environment), and try to call MPI_Init for re-creation
of MPI communicator and ranks. The problem is, that although the
processes are in fact new (different PIDs), LAM/MPI still recognizes
that MPI_Finalize had been called and refuses the creation of a new
MPI world.
My question is: is there any way of changing this behaviour? Is it
possible to tell the MPI routines and structures to set themselves up
from the scratch and make it possible to call MPI_Init again?
I should probably add, that such a solution is needed to achieve some
level of transparency of the checkpointing mechanism to the
programmer.
--
Greeting,
Marcin Fr¹czak mailto:marcin.f_at_[hidden]
|