LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-04 15:46:17


A few clarifications:

- Are you running on SMPs? I.e., is your master running on the same
node as at least one slave, and have slaves running on other nodes,
too?
- What RPI are you using?
- Can you verify that your master is stuck in the 200 loop? I.e., it's
looping around the MPI_Waitany and the processing that follows it?

On Oct 3, 2005, at 12:02 PM, Douglas Vechinski wrote:

>
> I have an MPI application configured with one master and many slaves
> that uses a single program model. After some initialization the master
> posts an MPI_IRECV() request for each slave and then arrives at the
> main
> wait loop. This loop uses a MPI_WAITANY() to wait for one of the
> slaves
> to send a message requesting some work. If work is available then the
> master sends the information to the requesting slave and then posts
> another MPI_IRECV() (using the same request array and index as before)
> and goes back to the WAITANY(). If there is no more work to do, the
> master sends a termination message to the slave so that it may quit and
> the master returns back to the WAITANY() command for messages from
> slaves that are still working.
>
> When the master receives a message from one of the slaves, it checks
> the
> tag. A certain tag value implies that the slave encountered some known
> error condition and is going to quit. Before the slave exits it sends
> the message with the appropriate tag to the master. When the master
> receives this, it takes note of it ( so that it can report that some of
> the work didnot finish at the end), and goes back to the WAITANY
> statement.
>
> With no errors (on the slave end) this runs fine with no problems.
> However, to simulate the error checking, I forced one of the slaves
> encounter one of these error conditions. The process works except that
> I made the following observation. After the slave exits (it does call
> MPI_FINIALIZE), the master process starts sucking up cpu time as if it
> is continuously running. The remaining slaves continue to do their
> work
> and communicate with the master until the work left to be done is
> finished but the master seems to be running full blast. Before, with
> no
> simulated error, the master is basically sitting idle because of the
> MPI_WAITANY() call.
>
> After some pondering I thought the problem might be due to the request
> array used in MPI_WAITANY after the error having an invalid request
> from
> the slave that exited earlier. However, the slave actually fulfills
> the
> request initially posted and my understanding is that request handle
> gets set to MPI_REQUEST_NULL so should not interfere when the WAITANY
> is
> called again.
>
> Any insights into why the master suddenly starts eating cpu cycles
> continuously. I've provided a snippet of code for the section
> described
> above. Before the subroutine below is called, the request array
> contains the request handles to an MPI_IRECV posted to each slave.
>
> =======================================================================
> ===
>
> subroutine work_loop(neye,eyeprm,job_info,request)
>
> include 'mpif.h'
>
> integer neye,eyeprm(*),job_info(*),request(*)
> integer slvterm,eyenum,iproc,ierr
> integer stats(MPI_STATUS_SIZE)
>
> ncmplt=0
> nsubmit=0
> eyenum=eyeprm(1)
>
> slvterm=-1
>
> 00200 continue
> if(ncmplt.lt.neye)then
>
> c Wait for one of the slaves to signal that it has finished with
> c its current work.
>
> call mpi_waitany(njobs,request,iproc,stats,ierr)
> itag=stats(MPI_TAG)
> if(itag.eq.TAG_WORK_REQUEST)then
> if(job_info(iproc).gt.0)then
> ncmplt=ncmplt+1
> endif
>
> c See if there are any more work to submit.
>
> if(nsubmit.lt.neye)then
> call mpi_send(eyenum,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
> & MPI_COMM_WORLD,ierr)
> nsubmit=nsubmit+1
> call mpi_irecv(job_info(iproc),1,MPI_INTEGER,iproc,
> & MPI_ANY_TAG,MPI_COMM_WORLD,request(iproc),ierr)
>
> c If not, then tell the slave to quit.
>
> else
> call mpi_send(slvterm,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
> & MPI_COMM_WORLD,ierr)
> endif
>
> goto 200
> elseif(itag.eq.TAG_SLVERR)then
> nslverr=nslverr+1
> ncmplt=ncmplt+1
> goto 200
> endif
> endif
>
> return
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/