On Wed, 10 Sep 2003, Jason D. Gans wrote:
> I'm having trouble getting code to run in "c2c" mode.
Minor quibble: with 7.0, we don't make the distinction between c2c and
lamd anymore -- all RPI's are now equal in the eyes of the law^H^H^HMPI
framework. ;-)
> The code runs intermittently and usually gets hung up in a call to
> MPI_Recv. The network is 100 mbs and a lot of data is being sent to the
> master node (however, each worker node waits it's turn before sending to
> the master).
>
> I have attempted to rule out a MPI send/recv bottleneck by replacing
> MPI_Send() with MPI_Ssend(). The code sill works fine in "-ssi rpi lamd
> " mode. Adding MPI_Barrier() calls do not cause a problem in "lamd"
> mode. Valgrind did not find any problems (and electric fence ran out of
> memory).
A possible cause for this could be a blocking communications pattern
-- something that mapping MPI_Send -> MPI_Ssend might not detect.
The benefit of using the lamd RPI module is that it allows true
communication "in the background" (since progress is made in a
non-blocking fashion in an entirely different process space), whereas
the TCP only "sort of" allows this (progress is only made either a)
when in the MPI library, or b) by the OS kernel when it has pending
incoming or outgoing data). Additionally, the TCP RPI is typically
"faster" than the lamd RPI (at least in terms of latency and pure
bandwidth, ignoring progression issues), so the timing of your
application may be different between the two different RPI modules.
Hence, it still is possible to get in a deadlock situation.
A few suggestions:
- ensure that you don't have a serialized or deadlock situation with
your communication pattern. Normally, I'd suggest attaching a
debugger at run-time and having a look around, but with 26 or 52
processes, that might get a little unweildy. So I'd resort to a
less-than-optimal-yet-at-least-somewhat-functional approach: put
some printf's in your code -- or better yet, write thin wrappers
around the MPI communication calls (and put the printfs in there)
that you use and use the profiling layer to call the real MPI calls,
and ensure that you're not in a deadlocking state.
- convert to using non-blocking sends and receives. This may actually
help your application's performance -- if you hand off a bunch of
communications to MPI and say "go do all of these", MPI can then
make progress on all of them simultaneously -- this may avoid
unintentional serialized communication patterns.
> Inspecting the output of tcpdump on the master node (that calls
> the MPI_Recv that hangs) reveals that the last worker node to
> send to the master continues to send ack packets:
>
> 13:16:43.425104 worker.33841 > master.33432: .
> 27657731:27659179(1448) ack 600305 win 8576 <nop,nop,timestamp
> 29674004 29671005> (DF)
>
> and I also see:
>
> 13:21:56.580532 arp who-has master tell worker
> 13:21:56.580539 arp reply master is-at 0:42:52:0:6a:3b
This should be unrelated. LAM does nothing with ARP; that should just be
normal TCP/IP overhead/maintenance activity that happens occassionally.
> If I wait a long time, I see
>
> icmp: ip reassembly time exceeded [tos 0xc0]
I *think* that this is unrelated as well; LAM doesn't use ICMP. However,
I'm not entirely sure what that message means -- I'm guessing it means
that there was simply some kind of timeout...?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|