On Dec 1, 2005, at 9:33 PM, Liu Xuezhao wrote:
> But I have a question here. Does openMPI support the fault-
> tolerance (i.e. checkpoint and restart for parellel application)? I
> know that LAM-MPI together with BLCR provide the checkpoint/restart
> functionality, will the CR module of LAM-MPI be involved in open-
> MPI? I have not found the information about it.
Open MPI does not currently support checkpoint / restart
functionality. This is an area of active research by the Open MPI
development team. We intend to support both the coordinated
checkpoint/restart used in LAM/MPI and many of the approaches used in
FT-MPI and MPICH-V.
> We are now do some work on the fault-tolerance or checkpoint/
> restart for MPI applications. It seems that LAM-MPI and BLCR can
> provide the basic solution for it. I wonder if there are practical
> using of it on real cluster system, can you tell me something about
> it? Thanks again.
LAM/MPI and BLCR provides a good solution for smaller clusters. For
larger clusters, the size of the checkpoints can make it
impractical. Some of the things we are looking at for Open MPI
should help the situation a bit. As for the details of the
checkpoint / restart software in LAM/MPI, we have a paper on our web
page about the implementation:
http://www.lam-mpi.org/papers/
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|