LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jason D. Gans (jgans_at_[hidden])
Date: 2003-09-10 19:58:27


Hello,

I'm having trouble getting code to run in "c2c" mode.
The code runs intermittently and usually gets hung up
in a call to MPI_Recv. The network is 100 mbs and a lot
of data is being sent to the master node (however, each
worker node waits it's turn before sending to the master).

I have attempted to rule out a MPI send/recv bottleneck
by replacing MPI_Send() with MPI_Ssend(). The code sill works
fine in "-ssi rpi lamd " mode. Adding MPI_Barrier() calls do
not cause a problem in "lamd" mode. Valgrind did not find any
problems (and electric fence ran out of memory).

I'm using a cluster (26 nodes) of Linux 2.4.18-27 machines
with 3 Intel PRO/100 NIC's [Intel(R) PRO/100 Network Driver -
version 2.3.27]. These are Redhat 7.3 installations.

Inspecting the output of tcpdump on the master node (that calls
the MPI_Recv that hangs) reveals that the last worker node to
send to the master continues to send ack packets:

13:16:43.425104 worker.33841 > master.33432: .
        27657731:27659179(1448) ack 600305 win 8576 <nop,nop,timestamp
        29674004 29671005> (DF)

and I also see:

13:21:56.580532 arp who-has master tell worker
13:21:56.580539 arp reply master is-at 0:42:52:0:6a:3b

If I wait a long time, I see

icmp: ip reassembly time exceeded [tos 0xc0]

laminfo reports:

           LAM/MPI: 7.0
            Prefix: /usr/local
      Architecture: i686-pc-linux-gnu
     Configured by: jgans
     Configured on: Sun Sep 7 21:58:00 CDT 2003
    Configure host: jojo
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: yes
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)

Any help or suggestions would be greatly appreaciated!

Regards,

Jason Gans
Bioscience Division (B-1)
Los Alamos National Lab