As high-performance clusters continue to grow in size and popularity,
issues of fault tolerance and reliability are becoming limiting factors on
To address these issues, we present the design and implementation of a
system for providing coordinated checkpointing and rollback recovery for
MPI-based parallel applications.
Our approach integrates the Berkeley Lab BLCR kernel-level process
checkpoint system with the LAM implementation of MPI through a defined
Checkpointing is transparent to the application, allowing the system to be
used for cluster maintenance and scheduling reasons as well as for fault
Experimental results show negligible communication performance impact due
to the incorporation of the checkpoint support capabilities into LAM/MPI.