LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: jess michelsen (jam_at_[hidden])
Date: 2003-12-19 21:50:35


Hi everyone!

I have for some time successfully been using LAM-MPI 7.0.2 for
large-scale CFD calculations on a P4 cluster running RH 8.0.

The nodes are connected by ordinary managed gigabit switches. NICs are
Intel 1000pro and the e1000 version is 5.2.20 which appears to be both
fast and stable. Max 64000 interrupts/sec and 32 usec delay for
transmit, no flow control.

Just recently, I've increased the number of nodes in a job to 84. The
job then crashes after some iterations, always at about the same amount
of computation.

The CFD code is a Fortran 95 code, compiled with Intel 7.1.There is a
relatively large amount of communication going on during the job. The
iteration that fails is identical to those before, which succeeded. The
computed results up to that point are identical to those obtained on a
smaller set of nodes. So it is safe to assume this is not a fortran bug.

Just to make sure we are not involved with a single piece of bad
equipment, the job has been run on three different sets of
nodes/switches.

Has anybody an idea or have you experienced similar situations.

best regards, Jess Michelsen

The output from the job is:

 n= 1 t= 0.00100 log(res) 0.000 0.000 -1.424 -1.186
 n= 2 t= 0.00200 log(res) -1.441 -1.491 -2.191 -2.266
 n= 3 t= 0.00300 log(res) -1.358 -1.438 -2.096 -2.186
 n= 4 t= 0.00400 log(res) -2.489 -2.752 -2.803 -3.007
 n= 5 t= 0.00500 log(res) -2.713 -2.973 -3.715 -3.227
 n= 6 t= 0.00600 log(res) -2.972 -3.201 -4.235 -3.330
MPI_Recv: process in local group is dead (rank 12, MPI_COMM_WORLD)
MPI_Recv: process in local group is dead (rank 20, MPI_COMM_WORLD)
MPI_Recv: process in local group is dead (rank 36, MPI_COMM_WORLD)
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 13369 failed on node n0 (172.16.2.1) with exit status 1.
-----------------------------------------------------------------------------
MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
MPI_Recv: process in local group is dead (rank 19, MPI_COMM_WORLD)

<snipped>

The configuration was

./configure --with-rsh="ssh" --with-boot=tm --with-tm=/usr/pbs
--with-tcp-short=131072 --prefix=/usr/lam --without-profiling
--without-romio > configure.out

(i.e. to run under PBS - which also fails)

The 84 nodes were lambooted to be run manually by:

lamboot -v -b -ssi boot rsh -ssi rsh_agent "ssh" hostfile

and the job run by:

mpirun -ssi rpi tcp -np 84 flowfield.