Hi all,
I am working with a 10 node LAM-MPI based cluster with BLCR. I would like to know what algorithm or protocol is used in coordinating the checkpointing behavior. I read in the mail archives that its a modified implementation of Candy Lamport algorithm. But that was found in the 2004 archives. Can somebody let me know currently in what way is the coordination done during checkpointing (on a call to cr_checkpoint). If there is a documentation of the algorithm used, it would be great if you can point me to the appropriate link. We are actually working on our bachelors thesis in college and would like to know about the coordination process done to get a global snapshot of the mpi application.
Thanks in advance,
Nannan