LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: jess michelsen (jam_at_[hidden])
Date: 2003-12-22 17:59:33


Hi Shashwat,

Running with memory check and debug (-C -g ) traps no errors. In fact,
when running under the debugger, the code performs faster than the
optimized one. I have now reduced the interrupt throttle to 32000 and
increased the number of descriptors to 256 in order to avoid trouble at
a gather operation. This does not change the situation.

I have then checked the output from top on all 84 nodes while the job is
running. One out of the 84, node 22 in the lamboot set is spending an
increasing amount of CPU on hboot. These runs are performed w/o PBSpro.
At the stage at which the PBS jobs would crash, hboot is eating up 99.9%
CPU on node 22(!). Everybody is then waiting for that node and during
the waiting, PBSpro decides that the node is dead.

I thought hboot would have finished at the time lamboot returned to the
prompt, i.e. before the job was started from the same prompt.

BTW: is it to be expected, that lamboot of 84 nodes takes 15-30 minutes?

Best regards, Jess

On Sun, 2003-12-21 at 19:18, Shashwat Srivastav wrote:
> Hi,
>
> Seeing the output it doesn't appear to me that it's safe to assume that
> it's not a fortran bug. Can you run it through a memory-checking
> debugger and see what happens ?
>
> Thanks.
> --
> Shashwat Srivastav
> LAM / MPI Developer (http://www.lam-mpi.org)
> Indiana University
> http://www.cs.indiana.edu/~ssrivast
>
> On Dec 19, 2003, at 9:50 PM, jess michelsen wrote:
>
> > Hi everyone!
> >
> > I have for some time successfully been using LAM-MPI 7.0.2 for
> > large-scale CFD calculations on a P4 cluster running RH 8.0.
> >
> > The nodes are connected by ordinary managed gigabit switches. NICs are
> > Intel 1000pro and the e1000 version is 5.2.20 which appears to be both
> > fast and stable. Max 64000 interrupts/sec and 32 usec delay for
> > transmit, no flow control.
> >
> > Just recently, I've increased the number of nodes in a job to 84. The
> > job then crashes after some iterations, always at about the same amount
> > of computation.
> >
> > The CFD code is a Fortran 95 code, compiled with Intel 7.1.There is a
> > relatively large amount of communication going on during the job. The
> > iteration that fails is identical to those before, which succeeded. The
> > computed results up to that point are identical to those obtained on a
> > smaller set of nodes. So it is safe to assume this is not a fortran
> > bug.
> >
> > Just to make sure we are not involved with a single piece of bad
> > equipment, the job has been run on three different sets of
> > nodes/switches.
> >
> > Has anybody an idea or have you experienced similar situations.
> >
> > best regards, Jess Michelsen
> >
> >
> >
> > The output from the job is:
> >
> > n= 1 t= 0.00100 log(res) 0.000 0.000 -1.424 -1.186
> > n= 2 t= 0.00200 log(res) -1.441 -1.491 -2.191 -2.266
> > n= 3 t= 0.00300 log(res) -1.358 -1.438 -2.096 -2.186
> > n= 4 t= 0.00400 log(res) -2.489 -2.752 -2.803 -3.007
> > n= 5 t= 0.00500 log(res) -2.713 -2.973 -3.715 -3.227
> > n= 6 t= 0.00600 log(res) -2.972 -3.201 -4.235 -3.330
> > MPI_Recv: process in local group is dead (rank 12, MPI_COMM_WORLD)
> > MPI_Recv: process in local group is dead (rank 20, MPI_COMM_WORLD)
> > MPI_Recv: process in local group is dead (rank 36, MPI_COMM_WORLD)
> > -----------------------------------------------------------------------
> > ------
> > One of the processes started by mpirun has exited with a nonzero exit
> > code. This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 13369 failed on node n0 (172.16.2.1) with exit status 1.
> > -----------------------------------------------------------------------
> > ------
> > MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
> > MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
> > MPI_Recv: process in local group is dead (rank 19, MPI_COMM_WORLD)
> >
> > <snipped>
> >
> > The configuration was
> >
> > ./configure --with-rsh="ssh" --with-boot=tm --with-tm=/usr/pbs
> > --with-tcp-short=131072 --prefix=/usr/lam --without-profiling
> > --without-romio > configure.out
> >
> > (i.e. to run under PBS - which also fails)
> >
> > The 84 nodes were lambooted to be run manually by:
> >
> > lamboot -v -b -ssi boot rsh -ssi rsh_agent "ssh" hostfile
> >
> > and the job run by:
> >
> > mpirun -ssi rpi tcp -np 84 flowfield.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/