LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Shashwat Srivastav (ssrivast_at_[hidden])
Date: 2003-12-21 13:18:54


Hi,

Seeing the output it doesn't appear to me that it's safe to assume that
it's not a fortran bug. Can you run it through a memory-checking
debugger and see what happens ?

Thanks.

--
Shashwat Srivastav
LAM / MPI Developer (http://www.lam-mpi.org)
Indiana University
http://www.cs.indiana.edu/~ssrivast
On Dec 19, 2003, at 9:50 PM, jess michelsen wrote:
> Hi everyone!
>
> I have for some time successfully been using LAM-MPI 7.0.2 for
> large-scale CFD calculations on a P4 cluster running RH 8.0.
>
> The nodes are connected by ordinary managed gigabit switches. NICs are
> Intel 1000pro and the e1000 version is 5.2.20 which appears to be both
> fast and stable. Max 64000 interrupts/sec and 32 usec delay for
> transmit, no flow control.
>
> Just recently, I've increased the number of nodes in a job to 84. The
> job then crashes after some iterations, always at about the same amount
> of computation.
>
> The CFD code is a Fortran 95 code, compiled with Intel 7.1.There is a
> relatively large amount of communication going on during the job. The
> iteration that fails is identical to those before, which succeeded. The
> computed results up to that point are identical to those obtained on a
> smaller set of nodes. So it is safe to assume this is not a fortran  
> bug.
>
> Just to make sure we are not involved with a single piece of bad
> equipment, the job has been run on three different sets of
> nodes/switches.
>
> Has anybody an idea or have you experienced similar situations.
>
> best regards, Jess Michelsen
>
>
>
> The output from the job is:
>
>  n=    1 t=      0.00100 log(res)   0.000   0.000  -1.424  -1.186
>  n=    2 t=      0.00200 log(res)  -1.441  -1.491  -2.191  -2.266
>  n=    3 t=      0.00300 log(res)  -1.358  -1.438  -2.096  -2.186
>  n=    4 t=      0.00400 log(res)  -2.489  -2.752  -2.803  -3.007
>  n=    5 t=      0.00500 log(res)  -2.713  -2.973  -3.715  -3.227
>  n=    6 t=      0.00600 log(res)  -2.972  -3.201  -4.235  -3.330
> MPI_Recv: process in local group is dead (rank 12, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 20, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 36, MPI_COMM_WORLD)
> ----------------------------------------------------------------------- 
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 13369 failed on node n0 (172.16.2.1) with exit status 1.
> ----------------------------------------------------------------------- 
> ------
> MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 19, MPI_COMM_WORLD)
>
> <snipped>
>
> The configuration was
>
> ./configure --with-rsh="ssh" --with-boot=tm --with-tm=/usr/pbs
> --with-tcp-short=131072 --prefix=/usr/lam --without-profiling
> --without-romio > configure.out
>
> (i.e. to run under PBS - which also fails)
>
> The 84 nodes were lambooted to be run manually by:
>
> lamboot -v -b -ssi boot rsh -ssi rsh_agent "ssh" hostfile
>
> and the job run by:
>
> mpirun -ssi rpi tcp -np 84 flowfield.
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>