LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Fabrizio Bisetti (fbisetti_at_[hidden])
Date: 2006-03-14 16:14:50


Hi all,

I'm having run time troubles with a code of ours. Please find attached
the output of laminfo. Unfortunately I couldn't locate the
configure.log as I didn't install lam-mpi myself.

The Communication Scheme
The code is is a PDE solver. Domain decomposition is applied in 1
direction only. Since we are exchanging the ghost cells at the
boundaries, this is the communication scheme (say for 3 procs)

          PE0 --> PE1 (update ghosts on 1)
          Barrier PE1 --> PE2 (update ghosts on 2)
          Barrier
          PE2 --> PE1 (update ghosts on 1)
          Barrier
          PE1 --> PE0 (update ghosts on 0)
          Barrier

This is done in the subroutine exchangeParticles.f90 which I'm
attaching. For each communication, the size of each bundle is around 18
MB.

The Problem
The crash is due to one of the procs being proclaimed dead while in
receive and the other in send. The communication locks when the PE in
the middle (PE1 in this case) sends to the last PE (PE2 for us).

Strange facts
- We experience the crash w/ 3 processors but *not* w/ 2 processors.
- Changing optimization flags doesn't change the problem
- It happens randomly for different runs, but for a given
   run, always at the same time-step.
- We experience the problem when running with a chemistry integration
   module that takes up some ~170 MB of RAM (a look-up table). When
   chemistry is turned-off, the communication crash disappears.

I've been trying everything. Any ideas?

Thanks in advance for your time/ideas!
Fabrizio