Hello,

I am working with LAM MPI and BLCR to checkpoint applications. I replicate the check pointed files across selected  LAM nodes and then try to restart application on them.

So when I tested in just a single node the application gets checkpointed with BLCR and the corresponding context file is produced. But when I tested it with two nodes the mpirun itself gets hung, so that it doesnt even gets checkpointed.

After debugging, I found that when an mpirun command is issued, MPI_init( ) gets blocked. After going deeper I found the problem in the following snippet of code in /share/mpi/laminit.c -

/*
* If spawned or started by mpirun, receive the list of GPS. Local
* world GPS's are first followed by the parents (if any). Otherwise if
* the number of processes is one assume a singleton init, else assume one
* process per node and pids are not needed.
*/

printf("kli_init() - Receiving GPS list \n");
if ((_kio.ki_parent > 0) || (_kio.ki_rtf & RTF_MPIRUN)) {

nhead.nh_event = (-lam_getpid()) & 0xBFFFFFFF;
nhead.nh_type = BLKMPIINIT;
nhead.nh_flags = DINT4DATA;
nhead.nh_length = procs_n * sizeof(struct _gps);
nhead.nh_msg = (char *) procs;

printf("Gonna do nrecv \n"); // Debugging statement added by me

if (nrecv(&nhead)) {              // nrecv blocks
free((char *) procs);
return(LAMERROR);
}

printf("Did nrecv \n");  // added by me

The nrecv gets blocked on the remote node, consequently MPI_Init() does not proceed further.

As this is happening even before the checkpoint module is called I am wondering will  this be  related to some network issues or due to some mismatch in directory structures across the nodes.






--

Mukuntan V Viswanathan

Visit My webpage at
www.buffalo.edu/~mvv2