On Mon, 21 Oct 2003, jess michelsen wrote:
> In order to test, whether I have the right latency and bandwidth in my
> bi-directional isend/irecv communications (Gigabit), I've put together a
> simple fortran program, as seen below. For small packet sizes, I get
> exactly the same timings (2*latency) as seen with NetPipe. For larger
> packets (up to 64Kb), I get almost (95%) the same bandwidth as seen with
> NetPipe (isn't NetPipe sending the packets uni-directionally?).
IIRC, NetPIPE's latency and bandwidth measurements are all ping-pong
divided by two. I'm not going to swear to this :-), but it would make
sense with the numbers you're seeing.
> However, once in a while during the test, one of the execution nodes
> 'hang'. It's even impossible to ssh to the node - so the power button is
> the only means of communication(!)
When you say that you can't ssh to the node, what exactly happens? Does
ssh time out, or give "no route to host"?
> My question is now: could this be a buffer issue (buffered send with a
> really big buffer didn't work better - only slower) -
MPI buffered sends are generally not a good idea; they force the MPI to
use an additional memory copy.
> or could there be a hardware flaw - or should I do the communication in
> another fashion?
Do you see the same kind of hangups when you run netpipe? Have you run
the TCP and MPI versions of netpipe? For example, if this happens even
during the TCP version of netpipe, that could be indicative of a device
driver and/or hardware issue.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|