LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Mukuntan (mukunth_at_[hidden])
Date: 2007-11-21 20:11:10


Hello,

I am working with LAM MPI and BLCR to checkpoint applications. I replicate
the check pointed files across selected LAM nodes and then try to restart
application on them.

So when I tested in just a single node the application gets checkpointed
with BLCR and the corresponding context file is produced. But when I tested
it with two nodes the mpirun itself gets hung, so that it doesnt even gets
checkpointed.

After debugging, I found that when an mpirun command is issued, MPI_init( )
gets blocked. After going deeper I found the problem in the following
snippet of code in /share/mpi/laminit.c -

/*
* If spawned or started by mpirun, receive the list of GPS. Local
* world GPS's are first followed by the parents (if any). Otherwise if
* the number of processes is one assume a singleton init, else assume one
* process per node and pids are not needed.
*/

printf("kli_init() - Receiving GPS list \n");
if ((_kio.ki_parent > 0) || (_kio.ki_rtf & RTF_MPIRUN)) {

nhead.nh_event = (-lam_getpid()) & 0xBFFFFFFF;
nhead.nh_type = BLKMPIINIT;
nhead.nh_flags = DINT4DATA;
nhead.nh_length = procs_n * sizeof(struct _gps);
nhead.nh_msg = (char *) procs;

printf("Gonna do nrecv \n"); // Debugging statement added by me

if (nrecv(&nhead)) { // nrecv blocks
free((char *) procs);
return(LAMERROR);
}

printf("Did nrecv \n"); // added by me

The nrecv gets blocked on the remote node, consequently MPI_Init() does not
proceed further.

As this is happening even before the checkpoint module is called I am
wondering will this be related to some network issues or due to some
mismatch in directory structures across the nodes.

-- 
Mukuntan V Viswanathan
Visit My webpage at
www.buffalo.edu/~mvv2