I'm back.
I sent something a few weeks ago asking if anyone sees problems running
Pallas benchmarks using sysv or usysv on dual processor x86 boxes.
Attached you'll find two files.
one is from
-ssi rpi usysv
and the other
-ssi rpi sysv
Note that
-ssi coll lam_basic
seems to avoid the problem.
Both appear to be deadlocked in the same way with slightly different functions
used for communication. usysv is using a spinlock while sysv uses a semop.
The problem seems to be that for a pair of processes on one box, one is
doing a send and is waiting on the spinlock and the other is blocked in a tcp
read.
blocked on the spinlock
On 172.16.0.113
Stack Trace 3.1
C spinlock, FP=bffff548
C lam_ssi_rpi_usysv_proc_read_env, FP=bffff578
C lam_ssi_rpi_usysv_advance_common, FP=bffff5b8
C lam_ssi_rpi_usysv_advance, FP=bffff5d8
C _mpi_req_advance, FP=bffff618
C lam_send, FP=bffff6d8
C PMPI_Send, FP=bffff708
C lam_ssi_coll_lam_basic_reduce_lin, FP=bffff758
C lam_ssi_coll_smp_reduce, FP=bffff788
C lam_ssi_coll_lam_basic_allreduce, FP=bffff7b8
C lam_ssi_coll_smp_allreduce, FP=bffff7e8
C MPI_Allreduce, FP=bffff828
blocked in socket read()
On 172.16.0.113
Stack Trace 2.1
read, FP=bffff578
C sread, FP=bffff578
C lam_ssi_rpi_tcp_proc_read_env, FP=bffff598
C lam_ssi_rpi_tcp_adv1, FP=bffff5b8
C lam_ssi_rpi_usysv_advance_common, FP=bffff5e8
C lam_ssi_rpi_usysv_advance, FP=bffff608
C _mpi_req_advance, FP=bffff648
C PMPI_Recv, FP=bffff708
C lam_ssi_coll_lam_basic_reduce_lin, FP=bffff758
C lam_ssi_coll_smp_reduce, FP=bffff788
C lam_ssi_coll_lam_basic_allreduce, FP=bffff7b8
C lam_ssi_coll_smp_allreduce, FP=bffff7e8
C MPI_Allreduce, FP=bffff828
C
It seems to me that there is a general "progress" problem here where the one
node is stuck waiting for a tcp message to arrive and the other is stuck waiting
for the neighbor to handle a shared-memory operation.
The machines I'm using are
dual P-III 3.0 GHz hyperthreading turned off.
I've lambooted 10 nodes (20 processes), but am only using 6 processes for these
tests.
$ lamnodes
n0 172.16.0.113:2:
n1 172.16.0.114:2:
n2 172.16.0.115:2:
n3 172.16.0.116:2:
n4 172.16.0.117:2:
n5 172.16.0.118:2:
n6 172.16.0.119:2:
n7 172.16.0.120:2:
n8 172.16.0.121:2:
n9 172.16.0.122:2:origin,this_node
$ mpirun C -np 6 -ssi rpi sysv PMB-MPI1
or
$ mpirun C -np 6 -ssi rpi usysv PMB-MPI1
Thanks for any help,
Bob Felderman
|