LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Mars Lenjoy (mars_lenjoy_at_[hidden])
Date: 2006-05-01 02:16:10


that's still a pain.

the application-level is easier for migration, if using some codes automaticly by compiler, that's really coooool!

:)

Brian Barrett <brbarret_at_[hidden]> wrote: On Apr 29, 2006, at 9:46 PM, Mars Lenjoy wrote:

> the lammpi does checkpoint by using blcr, it saves the whole
> process image to the context file. if the process' size is very
> huge, the context file's size will be also huge.
> it takes a long time.
>
> according to my test, the context file's size is up to the memory
> size.
> for example,
> int arr[700][700][700];
> it will be more than 1G memory, if the physical memory is less than
> that, the cr_restart will be failed.
>
> the mpi is very popular in high performance computing field, many
> problems in HPC needs to load huge size of data array to memory,
> that's a disaster for checkpoint. that's the point! so at this
> time, the CR seems useless...
>
> my question is that, any ideas to improve that? or any new features
> on this issue in next version's design?

Yes, this is a problem with most checkpoint/restart systems. There
are a number of solutions, with varying levels of usefulness. There
are checkpointing systems that only save memory that changed from the
previous checkpoint, there are message logging systems, and I believe
there have been some compression systems. The problem for HPC is
that none of these work really well for systems where you are trying
to checkpoint/restart large numbers of processes with memory that
changes often.

For some classes of HPC problems, there are points in the application
life where memory required for a restart is at a minimum. In these
cases, application-level checkpoint/restart has been extremely
effective because it limits the amount of information that needs to
be stored. There have been attempts to determine these points at
compile time, although I'm unaware of how successful they have been.

To answer your question, it's unlikely that we will be implementing
any further improvements to the LAM/MPI checkpoint/restart
functionality (other than required bug fixes), as LAM is currently in
maintenance mode. We are currently doing all new development work in
the Open MPI project. We are just starting to work on process-level
fault tolerance -- our initial work will be replicating the work from
LAM/MPI and MPICH-V, although we intend to look into some of the
problems with checkpointing HPC applications once we have a stable
fault tolerance framework to build on.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/
_______________________________________________
lam-devel mailing list
lam-devel_at_[hidden]
http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
		
---------------------------------
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! Messenger with Voice.