Hello,
I'm having trouble getting code to run in "c2c" mode.
The code runs intermittently and usually gets hung up
in a call to MPI_Recv. The network is 100 mbs and a lot
of data is being sent to the master node (however, each
worker node waits it's turn before sending to the master).
I have attempted to rule out a MPI send/recv bottleneck
by replacing MPI_Send() with MPI_Ssend(). The code sill works
fine in "-ssi rpi lamd " mode. Adding MPI_Barrier() calls do
not cause a problem in "lamd" mode. Valgrind did not find any
problems (and electric fence ran out of memory).
I'm using a cluster (26 nodes) of Linux 2.4.18-27 machines
with 3 Intel PRO/100 NIC's [Intel(R) PRO/100 Network Driver -
version 2.3.27]. These are Redhat 7.3 installations.
Inspecting the output of tcpdump on the master node (that calls
the MPI_Recv that hangs) reveals that the last worker node to
send to the master continues to send ack packets:
13:16:43.425104 worker.33841 > master.33432: .
27657731:27659179(1448) ack 600305 win 8576 <nop,nop,timestamp
29674004 29671005> (DF)
and I also see:
13:21:56.580532 arp who-has master tell worker
13:21:56.580539 arp reply master is-at 0:42:52:0:6a:3b
If I wait a long time, I see
icmp: ip reassembly time exceeded [tos 0xc0]
laminfo reports:
LAM/MPI: 7.0
Prefix: /usr/local
Architecture: i686-pc-linux-gnu
Configured by: jgans
Configured on: Sun Sep 7 21:58:00 CDT 2003
Configure host: jojo
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: yes
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
Any help or suggestions would be greatly appreaciated!
Regards,
Jason Gans
Bioscience Division (B-1)
Los Alamos National Lab
|