Hello,
I am working with LAM MPI and BLCR to checkpoint applications. I replicate
the check pointed files across selected LAM nodes and then try to restart
application on them.
So when I tested in just a single node the application gets checkpointed
with BLCR and the corresponding context file is produced. But when I tested
it with two nodes the mpirun itself gets hung, so that it doesnt even gets
checkpointed.
After debugging, I found that when an mpirun command is issued, MPI_init( )
gets blocked. After going deeper I found the problem in the following
snippet of code in /share/mpi/laminit.c -
/*
* If spawned or started by mpirun, receive the list of GPS. Local
* world GPS's are first followed by the parents (if any). Otherwise if
* the number of processes is one assume a singleton init, else assume one
* process per node and pids are not needed.
*/
printf("kli_init() - Receiving GPS list \n");
if ((_kio.ki_parent > 0) || (_kio.ki_rtf & RTF_MPIRUN)) {
nhead.nh_event = (-lam_getpid()) & 0xBFFFFFFF;
nhead.nh_type = BLKMPIINIT;
nhead.nh_flags = DINT4DATA;
nhead.nh_length = procs_n * sizeof(struct _gps);
nhead.nh_msg = (char *) procs;
printf("Gonna do nrecv \n"); // Debugging statement added by me
if (nrecv(&nhead)) { // nrecv blocks
free((char *) procs);
return(LAMERROR);
}
printf("Did nrecv \n"); // added by me
The nrecv gets blocked on the remote node, consequently MPI_Init() does not
proceed further.
As this is happening even before the checkpoint module is called I am
wondering will this be related to some network issues or due to some
mismatch in directory structures across the nodes.
--
Mukuntan V Viswanathan
Visit My webpage at
www.buffalo.edu/~mvv2
|