LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas Vechinski (douglas.vechinski_at_[hidden])
Date: 2005-08-18 08:54:58


I have a master process which spawns several slave processes. A small
amount of communication occurs between master and slave. When there is
no more work to be done, the master sends the slaves a message to quit.
When the slaves receive this message, they do some finishing up, and
then right before they call mpi_finalize, they send a single valued
message back to the master.

After the master has send the termination message to all the slaves, it
runs through a loop of several mpi_irecv to receive the last message
from the slaves and then calls a mpi_waitall(). Once this call is
satisfied, the master does a few small finishing/tidying up things and
then quits.

Initially this all seemed to work ok, but then latter when I added some
write statements to see what was going on, I started getting some MPI
errors. I eventually thought it might have something to do with the
slaves sending a message right before they quit, and when the master was
attempting to receive the last message they were done. But putting some
sleep statements in to force the slaves to linger around didn't fix it.
It seems to happens when I have a write statement in the master.

Below is a small excerpt from the master (The lines in quotes represent
other stuff being done):

       "send termination messages to slaves"

       do i=1,njobs
           call mpi_irecv(job_info(i),1,MPI_INTEGER,0,110,
     & wcarray(i),request(i),ierr)
       enddo
c write(*,*)'at waitall '
c call sleep(10)
       call mpi_waitall(njobs,request,MPI_STATUSES_IGNORE)

      "do some final stuff"

      call mpi_finalize()

Here is a piece from the slaveaves:

c Wait for the termination signal from the master.

      call mpi_recv(k,1,MPI_INTEGER,0,101,parent,stats,ierr)

      "do a few small things"

      junk=0
      call mpi_send(junk,1,MPI_INTEGER,myrank,110,parent,ierr)

      call mpi_finalize(ierr)

      write(*,*)'Slave #',jobnum, ': stopping '
      stop

This seems to run ok. But if I then uncomment the "write(*,*)'at
waitall' statement in the master, I get an MPI error that says
"MPI process rank 0 (n0, p2172) caught a SIGSEGV." for the master
process.

As I said, I was trying to see if the problem had anything to do with
the master attempting to receive the last message from the slave, but
the slave process was totally finished, or if the master was placing the
receive calls before the slaves may have sent them. I placed sleep
statements in various places, but whenever I put write statements to
show where a code was at a certain point I would get the SIGSEGV error.

While I don't plan on leaving the write statement in there. The fact
that it is bombing when it is in there makes me wonder if there is
someother problem lurking somewhere. Any suggestions?

Was performing these tests on a dual processor linux machine with LAM
6.5.9.