LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-02-05 09:39:49


You are indeed correct -- I think we definitely have a problem in the gm
RPI if gm ends up dropping a packet (E.g., due to a timeout). We're
looking into it -- hope to have a solution Real Soon Now...

A workaround that may or may not be possible until we get this fixed --
alter your program to regularly check for received messages. For example,
if you're polling only once in a great while for received messages,
increase the frequency a bit so that gm doesn't timeout and drop the
packet.

As I said, this may or may not be possible within the logic of your code,
but I thought I'd mention it anyway...

On Thu, 5 Feb 2004, Bogdan Costescu wrote:

>
> [ Pressed the wrong key and the message got sent before being finished and
> with some spelling mistakes... ]
>
> On Thu, 5 Feb 2004, Sergei Lisenkov wrote:
>
> > LAM internal GM send: gmID=3 'kappa2' send failed to complete (see kernel log for details): send timed out
>
> That is exactly the error message that I metioned in a previous e-mail
> about 2 week ago, also when running with Myrinet. Jeff Squyres said that
> yet another person has seen the same message and that there might be some
> problem in LAM-MPI.
>
> > LAM internal GM send: gmID=7 'kappa5' send failed to complete (see kernel log for details): send timed out
>
> ... but you get this message from all hosts. I only got it from one host
> and in all cases that I remember, it was n1 when running on 2 nodes or n2
> when running one 3 or more nodes (and I tried on different nodes to rule
> out hardware problems).
>
> > After lamboot, I run my code:
> > mpirun -np 13 ./test.x input > output &
>
> I usually add "-v" and "-O" (letter o, not zero), which might not be
> needed nowadays, but I got used to it.
>
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/