I apologize for not replying earlier -- too many deadlines these days.
:-\
I think you forgot the ierr argument in your call to MPI_WAITALL. This
is a common case for a segv in a fortran application (missing function
arguments, leading to bizarre side-effects such as yours [works "fine"
unless you add the write statement]).
Can you try adding that ", ierr" and see if that makes the problem go
away?
On Aug 25, 2005, at 3:47 PM, Douglas Vechinski wrote:
> I posted a message a week ago about a problem I am having but didn't
> hear responses. So I'm posting it again with an example that is giving
> me problems. Below is a reiteration of the problem. I've attached two
> small Fortran codes which demonstrate the problem (at least for me).
> The first mst.f is the master which spawns a couple of slave processes
> (slv.f). Make sure the slave executable is called 'slv' since this is
> the name that is used to spawn it off with. The master is set up to
> spawn two slaves off. Execute the master with only one process
>
> mpirun -np 1 mst
>
> As is, the master bombs right near the end. If you comment out line
> #91
> (c write(*,*)'at waitall ') and recompile and run, it appears to
> run fine. I'm having this problem with LAM 6.5.9 on a Mandrake 8.1
> machine and another machine with LAM 7.1.1 with RedHat Fedora 3.
>
> Below is my description of the problem from an earlier post.
>
> ----------------------------------------------------------------
>
>
> I have a master process which spawns several slave processes. A small
> amount of communication occurs between master and slave. When there is
> no more work to be done, the master sends the slaves a message to quit.
> When the slaves receive this message, they do some finishing up, and
> then right before they call mpi_finalize, they send a single valued
> message back to the master.
>
>
> After the master has send the termination message to all the slaves, it
> runs through a loop of several mpi_irecv to receive the last message
> from the slaves and then calls a mpi_waitall(). Once this call is
> satisfied, the master does a few small finishing/tidying up things and
> then quits.
>
>
> Initially this all seemed to work ok, but then latter when I added some
> write statements to see what was going on, I started getting some MPI
> errors. I eventually thought it might have something to do with the
> slaves sending a message right before they quit, and when the master
> was
> attempting to receive the last message they were done. But putting some
> sleep statements in to force the slaves to linger around didn't fix it.
> It seems to happens when I have a write statement in the master.
>
>
> Below is a small excerpt from the master (The lines in quotes represent
> other stuff being done):
>
>
> "send termination messages to slaves"
>
>
> do i=1,njobs
> call mpi_irecv(job_info(i),1,MPI_INTEGER,0,110,
> & wcarray(i),request(i),ierr)
> enddo
> c write(*,*)'at waitall '
> c call sleep(10)
> call mpi_waitall(njobs,request,MPI_STATUSES_IGNORE)
>
>
> "do some final stuff"
>
>
> call mpi_finalize()
>
>
> Here is a piece from the slaveaves:
>
>
> c Wait for the termination signal from the master.
>
>
> call mpi_recv(k,1,MPI_INTEGER,0,101,parent,stats,ierr)
>
>
> "do a few small things"
>
>
> junk=0
> call mpi_send(junk,1,MPI_INTEGER,myrank,110,parent,ierr)
>
>
> call mpi_finalize(ierr)
>
>
> write(*,*)'Slave #',jobnum, ': stopping '
> stop
>
>
> This seems to run ok. But if I then uncomment the "write(*,*)'at
> waitall' statement in the master, I get an MPI error that says
> "MPI process rank 0 (n0, p2172) caught a SIGSEGV." for the master
> process.
>
>
> As I said, I was trying to see if the problem had anything to do with
> the master attempting to receive the last message from the slave, but
> the slave process was totally finished, or if the master was placing
> the
> receive calls before the slaves may have sent them. I placed sleep
> statements in various places, but whenever I put write statements to
> show where a code was at a certain point I would get the SIGSEGV error.
>
>
> While I don't plan on leaving the write statement in there. The fact
> that it is bombing when it is in there makes me wonder if there is
> someother problem lurking somewhere. Any suggestions?
>
> -----------------------------------------------------------------------
>
>
>
>
>
>
> <mst.f><slv.f>_______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|