Yes, my test/developement set up is two Linux boxes, one with two cpus
and the other with a single (hyperthreading) cpu which "pretends" to
have two cpus.
I'm not specifying any specific RPI so I assume it defaults to something
appropriate.
After the simulated error of one of the slaves, the master is not stuck
in the 200 loop. It stays stuck within the MPI_WAITANY call and is
sucking cpu cycles in there. I placed a write statement before and
after the MPI_WAITANY statement to see if it is running through the 200
loop continuously when it starts eating cpu cycles. It is not.
> On Tue, 2005-10-04 at 20:29 -0500, lam-request_at_[hidden] wrote:
> > A few clarifications:
> >
> > - Are you running on SMPs? I.e., is your master running on the same
> > node as at least one slave, and have slaves running on other nodes,
> > too?
> > - What RPI are you using?
> > - Can you verify that your master is stuck in the 200 loop? I.e., it's
> > looping around the MPI_Waitany and the processing that follows it?
> >
> >
> > On Oct 3, 2005, at 12:02 PM, Douglas Vechinski wrote:
> >
> > >
> > > I have an MPI application configured with one master and many slaves
> > > that uses a single program model. After some initialization the master
> > > posts an MPI_IRECV() request for each slave and then arrives at the
> > > main
> > > wait loop. This loop uses a MPI_WAITANY() to wait for one of the
> > > slaves
> > > to send a message requesting some work. If work is available then the
> > > master sends the information to the requesting slave and then posts
> > > another MPI_IRECV() (using the same request array and index as before)
> > > and goes back to the WAITANY(). If there is no more work to do, the
> > > master sends a termination message to the slave so that it may quit and
> > > the master returns back to the WAITANY() command for messages from
> > > slaves that are still working.
> > >
> > > When the master receives a message from one of the slaves, it checks
> > > the
> > > tag. A certain tag value implies that the slave encountered some known
> > > error condition and is going to quit. Before the slave exits it sends
> > > the message with the appropriate tag to the master. When the master
> > > receives this, it takes note of it ( so that it can report that some of
> > > the work didnot finish at the end), and goes back to the WAITANY
> > > statement.
> > >
> > > With no errors (on the slave end) this runs fine with no problems.
> > > However, to simulate the error checking, I forced one of the slaves
> > > encounter one of these error conditions. The process works except that
> > > I made the following observation. After the slave exits (it does call
> > > MPI_FINIALIZE), the master process starts sucking up cpu time as if it
> > > is continuously running. The remaining slaves continue to do their
> > > work
> > > and communicate with the master until the work left to be done is
> > > finished but the master seems to be running full blast. Before, with
> > > no
> > > simulated error, the master is basically sitting idle because of the
> > > MPI_WAITANY() call.
> > >
> > > After some pondering I thought the problem might be due to the request
> > > array used in MPI_WAITANY after the error having an invalid request
> > > from
> > > the slave that exited earlier. However, the slave actually fulfills
> > > the
> > > request initially posted and my understanding is that request handle
> > > gets set to MPI_REQUEST_NULL so should not interfere when the WAITANY
> > > is
> > > called again.
> > >
> > > Any insights into why the master suddenly starts eating cpu cycles
> > > continuously. I've provided a snippet of code for the section
> > > described
> > > above. Before the subroutine below is called, the request array
> > > contains the request handles to an MPI_IRECV posted to each slave.
> > >
> > > =======================================================================
> > > ===
> > >
> > > subroutine work_loop(neye,eyeprm,job_info,request)
> > >
> > > include 'mpif.h'
> > >
> > > integer neye,eyeprm(*),job_info(*),request(*)
> > > integer slvterm,eyenum,iproc,ierr
> > > integer stats(MPI_STATUS_SIZE)
> > >
> > > ncmplt=0
> > > nsubmit=0
> > > eyenum=eyeprm(1)
> > >
> > > slvterm=-1
> > >
> > > 00200 continue
> > > if(ncmplt.lt.neye)then
> > >
> > > c Wait for one of the slaves to signal that it has finished with
> > > c its current work.
> > >
> > > call mpi_waitany(njobs,request,iproc,stats,ierr)
> > > itag=stats(MPI_TAG)
> > > if(itag.eq.TAG_WORK_REQUEST)then
> > > if(job_info(iproc).gt.0)then
> > > ncmplt=ncmplt+1
> > > endif
> > >
> > > c See if there are any more work to submit.
> > >
> > > if(nsubmit.lt.neye)then
> > > call mpi_send(eyenum,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
> > > & MPI_COMM_WORLD,ierr)
> > > nsubmit=nsubmit+1
> > > call mpi_irecv(job_info(iproc),1,MPI_INTEGER,iproc,
> > > & MPI_ANY_TAG,MPI_COMM_WORLD,request(iproc),ierr)
> > >
> > > c If not, then tell the slave to quit.
> > >
> > > else
> > > call mpi_send(slvterm,1,MPI_INTEGER,iproc,TAG_WORK_SEND,
> > > & MPI_COMM_WORLD,ierr)
> > > endif
> > >
> > > goto 200
> > > elseif(itag.eq.TAG_SLVERR)then
> > > nslverr=nslverr+1
> > > ncmplt=ncmplt+1
> > > goto 200
> > > endif
> > > endif
> > >
> > > return
> > >
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
|