Hi,
Seeing the output it doesn't appear to me that it's safe to assume that
it's not a fortran bug. Can you run it through a memory-checking
debugger and see what happens ?
Thanks.
--
Shashwat Srivastav
LAM / MPI Developer (http://www.lam-mpi.org)
Indiana University
http://www.cs.indiana.edu/~ssrivast
On Dec 19, 2003, at 9:50 PM, jess michelsen wrote:
> Hi everyone!
>
> I have for some time successfully been using LAM-MPI 7.0.2 for
> large-scale CFD calculations on a P4 cluster running RH 8.0.
>
> The nodes are connected by ordinary managed gigabit switches. NICs are
> Intel 1000pro and the e1000 version is 5.2.20 which appears to be both
> fast and stable. Max 64000 interrupts/sec and 32 usec delay for
> transmit, no flow control.
>
> Just recently, I've increased the number of nodes in a job to 84. The
> job then crashes after some iterations, always at about the same amount
> of computation.
>
> The CFD code is a Fortran 95 code, compiled with Intel 7.1.There is a
> relatively large amount of communication going on during the job. The
> iteration that fails is identical to those before, which succeeded. The
> computed results up to that point are identical to those obtained on a
> smaller set of nodes. So it is safe to assume this is not a fortran
> bug.
>
> Just to make sure we are not involved with a single piece of bad
> equipment, the job has been run on three different sets of
> nodes/switches.
>
> Has anybody an idea or have you experienced similar situations.
>
> best regards, Jess Michelsen
>
>
>
> The output from the job is:
>
> n= 1 t= 0.00100 log(res) 0.000 0.000 -1.424 -1.186
> n= 2 t= 0.00200 log(res) -1.441 -1.491 -2.191 -2.266
> n= 3 t= 0.00300 log(res) -1.358 -1.438 -2.096 -2.186
> n= 4 t= 0.00400 log(res) -2.489 -2.752 -2.803 -3.007
> n= 5 t= 0.00500 log(res) -2.713 -2.973 -3.715 -3.227
> n= 6 t= 0.00600 log(res) -2.972 -3.201 -4.235 -3.330
> MPI_Recv: process in local group is dead (rank 12, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 20, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 36, MPI_COMM_WORLD)
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 13369 failed on node n0 (172.16.2.1) with exit status 1.
> -----------------------------------------------------------------------
> ------
> MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
> MPI_Recv: process in local group is dead (rank 19, MPI_COMM_WORLD)
>
> <snipped>
>
> The configuration was
>
> ./configure --with-rsh="ssh" --with-boot=tm --with-tm=/usr/pbs
> --with-tcp-short=131072 --prefix=/usr/lam --without-profiling
> --without-romio > configure.out
>
> (i.e. to run under PBS - which also fails)
>
> The 84 nodes were lambooted to be run manually by:
>
> lamboot -v -b -ssi boot rsh -ssi rsh_agent "ssh" hostfile
>
> and the job run by:
>
> mpirun -ssi rpi tcp -np 84 flowfield.
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|