LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Peter Beerli (beerli_at_[hidden])
Date: 2004-02-04 10:42:48


Hi there,

I have a program that dies on my cluster with 16 nodes (2 cpus per
node), the program runs fine on an IBM SP4 machine (using 44 cpus, using
their own MPI and loadleveler). On small problems the program seems to
run fine on my cluster, too. I might have a programming bug that shows
up only on large problems but the death seems to be at random places.

I noticed that after such a sudden death of a random node, a lamclean
does
not wipe off all processes, a wipe -v lamhosts neither [see below]. It
seems that
lam lost the list of hosts but a lamnodes still reports all nodes?
[I start the run doing this mpirun -np 29 -O -s n0 migrate-mpiss
parmfile-short].

The home directory is NSF mounted (and we had
difficulties with that in the past) and the lam home sits on local
directories on every node (all are the same version, lam-7.0.3 compiled
with gcc 3.3)

Could it be that my cluster has a flaky setup that kills the connections
under heavy load? What is the easiest way to separate hardware problems
from program-code or lam?

I am happy to supply more info, any idea how to tackle this one? I
simply hope that I am doing something stupid because that is easier to
fix than hardware issues.

Peter

<<<<<<<<<excerpts>>>>>>>>>>>>>>>>>>
........ program runs for about 3 minutes............
09:42:16 Equilibrate tree (first 10 trees are not used)
SEVERE ERROR: Segmentation fault on Node **5**
               this results in an non recoverable crash.
               But check the datatype and your infile for errors, too.
               Please report error with as much detail as possible to
               Peter Beerli <beerli@>

------------------------------------------------------------------------
-----
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 22535 failed on node n5 (144.172.131.142) with exit status 11.
------------------------------------------------------------------------
-----
p017:test>clucom -lam 'ps auxw | grep migrate'
starting....
p017.> beerli 18621 0.0 0.0 2224 960 pts/2 S 09:43 0:00
sh -c ssh p017. ps auxw | grep migrate
p017.> beerli 18623 0.0 0.0 1760 604 pts/2 S 09:43 0:00
grep migrate
p016.> beerli 13232 81.9 0.8 21428 17740 ? S 09:34 7:18
migrate-mpiss parmfile-short
p015.> beerli 21861 82.0 0.8 21932 18240 ? S 09:34 7:18
migrate-mpiss parmfile-short
p014.> beerli 22630 81.7 0.8 21260 17476 ? S 09:34 7:17
migrate-mpiss parmfile-short
p013.> beerli 22460 81.9 0.9 22672 18988 ? S 09:34 7:18
migrate-mpiss parmfile-short
p009.> beerli 19346 81.7 0.8 22176 18384 ? S 09:34 7:18
migrate-mpiss parmfile-short
p008.> beerli 20166 81.6 0.9 22432 18756 ? S 09:34 7:17
migrate-mpiss parmfile-short
p006.> beerli 22577 81.4 0.8 22080 18292 ? S 09:34 7:17
migrate-mpiss parmfile-short
p002.> beerli 12032 81.5 0.8 21608 17936 ? S 09:34 7:17
migrate-mpiss parmfile-short
ending...
p017:test>lamclean
p017:test>clucom -lam 'ps auxw | grep migrate'
starting....
p017.> beerli 18682 1.0 0.0 2224 960 pts/2 S 09:44 0:00
sh -c ssh p017. ps auxw | grep migrate
p017.> beerli 18684 0.0 0.0 1760 600 pts/2 S 09:44 0:00
grep migrate
p016.> beerli 13232 77.8 0.8 21428 17740 ? S 09:34 7:18
migrate-mpiss parmfile-short
p015.> beerli 21861 77.9 0.8 21932 18240 ? S 09:34 7:18
migrate-mpiss parmfile-short
p014.> beerli 22630 77.7 0.8 21260 17476 ? S 09:34 7:17
migrate-mpiss parmfile-short
p013.> beerli 22460 77.8 0.9 22672 18988 ? S 09:34 7:18
migrate-mpiss parmfile-short
p009.> beerli 19346 77.6 0.8 22176 18384 ? S 09:34 7:18
migrate-mpiss parmfile-short
p008.> beerli 20166 77.6 0.9 22432 18756 ? S 09:34 7:17
migrate-mpiss parmfile-short
p006.> beerli 22577 77.3 0.8 22080 18292 ? S 09:34 7:17
migrate-mpiss parmfile-short
p002.> beerli 12032 77.4 0.8 21608 17936 ? S 09:34 7:17
migrate-mpiss parmfile-short
ending...