LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Andrew Friedley (afriedle_at_[hidden])
Date: 2006-03-15 11:10:44


Fabrizio Bisetti wrote:
> Hi all,
>
> I'm having run time troubles with a code of ours. Please find attached
> the output of laminfo. Unfortunately I couldn't locate the configure.log
> as I didn't install lam-mpi myself.
>
> The Communication Scheme
> The code is is a PDE solver. Domain decomposition is applied in 1
> direction only. Since we are exchanging the ghost cells at the
> boundaries, this is the communication scheme (say for 3 procs)
>
> PE0 --> PE1 (update ghosts on 1)
> Barrier PE1 --> PE2 (update ghosts on 2)
> Barrier
> PE2 --> PE1 (update ghosts on 1)
> Barrier
> PE1 --> PE0 (update ghosts on 0)
> Barrier
>
> This is done in the subroutine exchangeParticles.f90 which I'm
> attaching. For each communication, the size of each bundle is around 18 MB.
>
> The Problem
> The crash is due to one of the procs being proclaimed dead while in
> receive and the other in send. The communication locks when the PE in
> the middle (PE1 in this case) sends to the last PE (PE2 for us).
>
> Strange facts
> - We experience the crash w/ 3 processors but *not* w/ 2 processors.
> - Changing optimization flags doesn't change the problem
> - It happens randomly for different runs, but for a given
> run, always at the same time-step.
> - We experience the problem when running with a chemistry integration
> module that takes up some ~170 MB of RAM (a look-up table). When
> chemistry is turned-off, the communication crash disappears.
>
> I've been trying everything. Any ideas?

I don't have a solution but I have ideas, but from the error below
(process in local group is dead) it *looks* like rank 2 is crashing.

What I'd try to do is use a debugger to figure out what's going on in
that rank. You'll need to compile your application with debugging
symbols at the very least, and it would help if you used a LAM build
with debugging enabled as well.

Due to the conditions under which the hang occurs, I'm wondering if some
of the parameters being into the mpi_recv at rank 2 between barrier 1
and 2 are incorrect or bad somehow. Can you make sure the data is what
you expect it do be?

Here are some useful FAQ entries for debugging, in case you don't know
about these tricks:

http://www.lam-mpi.org/faq/category6.php3

Question 5 would be useful for debugging what is going on in rank 2.

http://www.lam-mpi.org/faq/category2.php3

Question 15 here covers how to build LAM with debugging enabled.

It would be more useful on my end if you could provide a simplified,
complete program that demonstrates the error, if possible. The code
below is enough to give ideas, but I can't really debug anything with
only this subroutine.

Andrew

> ------------------------------------------------------------------------
>
> odd pe 1 receiving from even pe 0
> pe 2 hit barrier 1
> sending from even pe 0 to odd pe 1
> sent from even pe 0 to odd pe 1
> pe 0 hit barrier 1
> odd pe 1 received from even pe 0
> pe 1 hit barrier 1
> pe 0 hit barrier 2
> sending from odd pe 1 to even pe 2
> even pe 2 receiving from odd pe 1
> MPI_Send: process in local group is dead (rank 1, MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Send()
> Rank (1, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 15284 failed on node n0 (10.0.0.3) with exit status 1.
> -----------------------------------------------------------------------------

> !***s* pdfmcCommunication/exchangeBoundaryParticles
>
> ! NAME
> ! exchangeBoundaryParticles -- To exchange the particles in the ghost
> ! cells
> ! USAGE
> ! call exchangeBoundaryParticles()
> ! PURPOSE
> ! Exchange the boundary particles to populate the ghost
> ! cells of adjacent procs with up-to-date data.
> ! If called on a one-proc network, it just returns, with little
> ! overhead.
> ! AUTHOR
> ! Fabrizio Bisetti (fbisetti .at. me .dot. berkeley .dot. edu)
> ! CREATION DATE
> ! Friday January 28, 2005
> ! INPUTS
> ! None
> ! OUTPUTS
> ! None
> ! USES
> ! pdfmcCommVars
> ! pdfmcMainVars
>
> !*****
> !----------------------------------------------------------------------
>
> SUBROUTINE exchangeboundaryparticles()
> !----------------------------------------------------------------------
> use pdfmccommvars
> use pdfmcmainvars
> !----------------------------------------------------------------------
> IMPLICIT NONE
> !----------------------------------------------------------------------
> INCLUDE 'mpif.h'
> !----------------------------------------------------------------------
> INTEGER :: ier, STATUS(mpi_status_size)
> !----------------------------------------------------------------------
>
> ! safety check: if only 1 proc, return
> IF ( npes == 1 ) RETURN
>
> ! (1) >>> send downstream (even) --> (odd)
> IF ( ( genr == 0 ) .AND. ( mype /= npes-1 ) ) THEN
> !write(*,*) 'sending from even pe ',mype,' to odd pe ',downProc
> CALL mpi_send(particles(1,1,1,1,maindirendmype-nrghosts+1), &
> ghostbundlelength, mpi_real, downproc, 0, comm, ier)
> !write(*,*) 'sent from even pe ',mype,' to odd pe ',downProc
> END IF
> IF ( genr == 1 ) THEN
> !write(*,*) 'odd pe ',mype,' receiving from even pe ',upProc
> CALL mpi_recv(particles(1,1,1,1,maindirstartmypeghost), &
> ghostbundlelength, mpi_real, upproc, 0, comm, STATUS, ier)
> !write(*,*) 'odd pe ',mype,' received from even pe ',upProc
> END IF
> !write(*,*) 'pe ',mype,' hit barrier 1'
> CALL mpi_barrier(comm,ier)
> ! (2) >>> send downstream (odd) --> (even)
> IF ( ( genr == 1 ) .AND. ( mype /= npes-1 ) ) THEN
> !write(*,*) 'sending from odd pe ',mype,' to even pe ',downProc
> CALL mpi_send(particles(1,1,1,1,maindirendmype-nrghosts+1), &
> ghostbundlelength, mpi_real, downproc, 0, comm, ier)
> !write(*,*) 'sent from odd pe ',mype,' to even pe ',downProc
> END IF
> IF ( ( genr == 0 ) .AND. ( mype /= 0 ) ) THEN
> !write(*,*) 'even pe ',mype,' receiving from odd pe ',upProc
> CALL mpi_recv(particles(1,1,1,1,maindirstartmypeghost), &
> ghostbundlelength, mpi_real, upproc, 0, comm, STATUS, ier)
> !write(*,*) 'even pe ',mype,' received from odd pe ',upProc
> END IF
> !write(*,*) 'pe ',mype,' hit barrier 2'
> CALL mpi_barrier(comm,ier)
> ! (3) >>> send upstream (even) --> (odd)
> IF ( ( genr == 0 ) .AND. ( mype /= 0 ) ) THEN
> !write(*,*) 'sending from even pe ',mype,' to odd pe ',upProc
> CALL mpi_send(particles(1,1,1,1,maindirstartmype), &
> ghostbundlelength, mpi_real, upproc, 0, comm, ier)
> !write(*,*) 'sent from even pe ',mype,' to odd pe ',upProc
> END IF
> IF ( ( genr == 1 ) .AND. ( mype /= npes-1 ) ) THEN
> !write(*,*) 'odd pe ',mype,' receiving from even pe ',downProc
> CALL mpi_recv(particles(1,1,1,1,maindirendmype+1), &
> ghostbundlelength, mpi_real, downproc, 0, comm, STATUS, ier)
> !write(*,*) 'odd pe ',mype,' received from even pe ',downProc
> END IF
> !write(*,*) 'pe ',mype,' hit barrier 3'
> CALL mpi_barrier(comm,ier)
> ! (4) >>> send upstream (odd) --> (even)
> IF ( ( genr == 1 ) .AND. ( mype /= 0 ) ) THEN
> !write(*,*) 'sending from odd pe ',mype,' to even pe ',upProc
> CALL mpi_send(particles(1,1,1,1,maindirstartmype), &
> ghostbundlelength, mpi_real, upproc, 0, comm, ier)
> !write(*,*) 'sent from odd pe ',mype,' to even pe ',upProc
> END IF
> IF ( ( genr == 0 ) .AND. ( mype /= npes-1 ) ) THEN
> !write(*,*) 'even pe ',mype,' receiving from odd pe ',downProc
> CALL mpi_recv(particles(1,1,1,1,maindirendmype+1), &
> ghostbundlelength, mpi_real, downproc, 0, comm, STATUS, ier)
> !write(*,*) 'even pe ',mype,' received from odd pe ',downProc
> END IF
> !write(*,*) 'pe ',mype,' hit barrier 4'
> CALL mpi_barrier(comm,ier)
>
> !----------------------------------------------------------------------
> END SUBROUTINE exchangeboundaryparticles
> !----------------------------------------------------------------------