Hello,
I am working with LAM MPI and BLCR to checkpoint applications. I replicate the check pointed files across selected LAM nodes and then try to restart application on them.
So when I tested in just a single node the application gets checkpointed with BLCR and the corresponding context file is produced. But when I tested it with two nodes the mpirun itself gets hung, so that it doesnt even gets checkpointed.
After debugging, I found that when an mpirun command is issued, MPI_init( ) gets blocked. After going deeper I found the problem in the following snippet of code in /share/mpi/laminit.c -
/*
* If spawned or started by mpirun, receive the list of GPS.
Local
* world GPS's are first followed by the parents (if any). Otherwise
if
* the number of processes is one assume a singleton init, else assume
one
* process per node and pids are not
needed.
*/
printf("kli_init() - Receiving GPS list \n");
if
((_kio.ki_parent > 0) || (_kio.ki_rtf & RTF_MPIRUN))
{
nhead.nh_event = (-lam_getpid()) & 0xBFFFFFFF;
nhead.nh_type =
BLKMPIINIT;
nhead.nh_flags = DINT4DATA;
nhead.nh_length = procs_n *
sizeof(struct _gps);
nhead.nh_msg = (char *) procs;
printf("Gonna do
nrecv \n"); // Debugging statement added by me
if (nrecv(&nhead)) { // nrecv blocks
free((char *)
procs);
return(LAMERROR);
}
printf("Did nrecv \n"); // added by me
The nrecv gets blocked on the remote node, consequently MPI_Init() does not proceed further.
As this is happening even before the checkpoint module is called I am wondering will this be related to some network issues or due to some mismatch in directory structures across the nodes.
--
Mukuntan V Viswanathan
Visit My webpage at
www.buffalo.edu/~mvv2