I have an MPI application configured with one master and many slaves
that uses a single program model. After some initialization the master
posts an MPI_IRECV() request for each slave and then arrives at the main
wait loop. This loop uses a MPI_WAITANY() to wait for one of the slaves
to send a message requesting some work. If work is available then the
master sends the information to the requesting slave and then posts
another MPI_IRECV() (using the same request array and index as before)
and goes back to the WAITANY(). If there is no more work to do, the
master sends a termination message to the slave so that it may quit and
the master returns back to the WAITANY() command for messages from
slaves that are still working.
When the master receives a message from one of the slaves, it checks the
tag. A certain tag value implies that the slave encountered some known
error condition and is going to quit. Before the slave exits it sends
the message with the appropriate tag to the master. When the master
receives this, it takes note of it ( so that it can report that some of
the work didnot finish at the end), and goes back to the WAITANY
statement.
With no errors (on the slave end) this runs fine with no problems.
However, to simulate the error checking, I forced one of the slaves
encounter one of these error conditions. The process works except that
I made the following observation. After the slave exits (it does call
MPI_FINIALIZE), the master process starts sucking up cpu time as if it
is continuously running. The remaining slaves continue to do their work
and communicate with the master until the work left to be done is
finished but the master seems to be running full blast. Before, with no
simulated error, the master is basically sitting idle because of the
MPI_WAITANY() call.
After some pondering I thought the problem might be due to the request
array used in MPI_WAITANY after the error having an invalid request from
the slave that exited earlier. However, the slave actually fulfills the
request initially posted and my understanding is that request handle
gets set to MPI_REQUEST_NULL so should not interfere when the WAITANY is
called again.
Any insights into why the master suddenly starts eating cpu cycles
continuously. I've provided a snippet of code for the section described
above. Before the subroutine below is called, the request array
contains the request handles to an MPI_IRECV posted to each slave.
==========================================================================
subroutine work_loop(neye,eyeprm,job_info,request)
include 'mpif.h'
integer neye,eyeprm(*),job_info(*),request(*)
integer slvterm,eyenum,iproc,ierr
integer stats(MPI_STATUS_SIZE)
ncmplt=0
nsubmit=0
eyenum=eyeprm(1)
slvterm=-1
00200 continue
if(ncmplt.lt.neye)then
c Wait for one of the slaves to signal that it has finished with
c its current work.
call mpi_waitany(njobs,request,iproc,stats,ierr)
itag=stats(MPI_TAG)
if(itag.eq.TAG_WORK_REQUEST)then
if(job_info(iproc).gt.0)then
ncmplt=ncmplt+1
endif
c See if there are any more work to submit.
if(nsubmit.lt.neye)then
call mpi_send(eyenum,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
& MPI_COMM_WORLD,ierr)
nsubmit=nsubmit+1
call mpi_irecv(job_info(iproc),1,MPI_INTEGER,iproc,
& MPI_ANY_TAG,MPI_COMM_WORLD,request(iproc),ierr)
c If not, then tell the slave to quit.
else
call mpi_send(slvterm,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
& MPI_COMM_WORLD,ierr)
endif
goto 200
elseif(itag.eq.TAG_SLVERR)then
nslverr=nslverr+1
ncmplt=ncmplt+1
goto 200
endif
endif
return
|