Hi,
I have LAM7.1.1 running on a cluster of dual G5 nodes on OSX.
On some nodes LAM is working perfectly with usysv, sysv, and tcp RPI's, but
there are 4 nodes where the usysv and sysv RPI's intermittently fail to
start. Sometimes they work, but more often they don't. None of the nodes
are heavily loaded (yet). The same problem occurs if the job is submitted
through PBS (using tm boot) or using rsh boot. Meanwhile tcp always works
perfectly on all nodes.
Yesterday I re-installed LAM on all nodes, but that didn't help. I have
compared the mpi directories between good and bad nodes, and there is no
difference. I have compared the /etc/rc files (where shared memory sizes
are set I think) between good and bad nodes, and there is also no
difference. I have also tried rebooting the offending nodes. That seemed
to work for a little while (4 or 5 jobs), but then the same problem came
back.
Does anyone know what the problem might be, or know what other files are
pertinent (so I can compare between good and bad nodes) ?
Thanks in advance,
Sean Dettrick
x:~/test sdettrick$ lamboot -v -b -ssi boot rsh -ssi rpi usysv
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<22095> ssi:boot:base:linear: booting n0 (x.tae.cluster)
n-1<22095> ssi:boot:base:linear: booting n1 (x1.tae.cluster)
n-1<22095> ssi:boot:base:linear: booting n2 (x2.tae.cluster)
n-1<22095> ssi:boot:base:linear: booting n3 (x3.tae.cluster)
n-1<22095> ssi:boot:base:linear: finished
x:~/test sdettrick$ mpirun -wd ~/test -np 8 -ssi rpi usysv ./a.out
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.
This occurred on host x3.tae.cluster (n3).
The PID of failed process was 1099 (MPI_COMM_WORLD rank: 6)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.
This occurred on host x2.tae.cluster (n2).
The PID of failed process was 1113 (MPI_COMM_WORLD rank: 4)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 22107 failed on node n0 (10.0.1.254) with exit status 1.
-----------------------------------------------------------------------------
x:~/test sdettrick$
|