LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas Vechinski (douglas.vechinski_at_[hidden])
Date: 2005-08-25 14:47:32


I posted a message a week ago about a problem I am having but didn't
hear responses. So I'm posting it again with an example that is giving
me problems. Below is a reiteration of the problem. I've attached two
small Fortran codes which demonstrate the problem (at least for me).
The first mst.f is the master which spawns a couple of slave processes
(slv.f). Make sure the slave executable is called 'slv' since this is
the name that is used to spawn it off with. The master is set up to
spawn two slaves off. Execute the master with only one process

mpirun -np 1 mst

As is, the master bombs right near the end. If you comment out line #91
(c write(*,*)'at waitall ') and recompile and run, it appears to
run fine. I'm having this problem with LAM 6.5.9 on a Mandrake 8.1
machine and another machine with LAM 7.1.1 with RedHat Fedora 3.

Below is my description of the problem from an earlier post.

----------------------------------------------------------------

I have a master process which spawns several slave processes. A small
amount of communication occurs between master and slave. When there is
no more work to be done, the master sends the slaves a message to quit.
When the slaves receive this message, they do some finishing up, and
then right before they call mpi_finalize, they send a single valued
message back to the master.

After the master has send the termination message to all the slaves, it
runs through a loop of several mpi_irecv to receive the last message
from the slaves and then calls a mpi_waitall(). Once this call is
satisfied, the master does a few small finishing/tidying up things and
then quits.

Initially this all seemed to work ok, but then latter when I added some
write statements to see what was going on, I started getting some MPI
errors. I eventually thought it might have something to do with the
slaves sending a message right before they quit, and when the master was
attempting to receive the last message they were done. But putting some
sleep statements in to force the slaves to linger around didn't fix it.
It seems to happens when I have a write statement in the master.

Below is a small excerpt from the master (The lines in quotes represent
other stuff being done):

       "send termination messages to slaves"

       do i=1,njobs
           call mpi_irecv(job_info(i),1,MPI_INTEGER,0,110,
     & wcarray(i),request(i),ierr)
       enddo
c write(*,*)'at waitall '
c call sleep(10)
       call mpi_waitall(njobs,request,MPI_STATUSES_IGNORE)

      "do some final stuff"

      call mpi_finalize()

Here is a piece from the slaveaves:

c Wait for the termination signal from the master.

      call mpi_recv(k,1,MPI_INTEGER,0,101,parent,stats,ierr)

      "do a few small things"

      junk=0
      call mpi_send(junk,1,MPI_INTEGER,myrank,110,parent,ierr)

      call mpi_finalize(ierr)

      write(*,*)'Slave #',jobnum, ': stopping '
      stop

This seems to run ok. But if I then uncomment the "write(*,*)'at
waitall' statement in the master, I get an MPI error that says
"MPI process rank 0 (n0, p2172) caught a SIGSEGV." for the master
process.

As I said, I was trying to see if the problem had anything to do with
the master attempting to receive the last message from the slave, but
the slave process was totally finished, or if the master was placing
the
receive calls before the slaves may have sent them. I placed sleep
statements in various places, but whenever I put write statements to
show where a code was at a certain point I would get the SIGSEGV error.

While I don't plan on leaving the write statement in there. The fact
that it is bombing when it is in there makes me wonder if there is
someother problem lurking somewhere. Any suggestions?

-----------------------------------------------------------------------