You are indeed correct -- I think we definitely have a problem in the gm
RPI if gm ends up dropping a packet (E.g., due to a timeout). We're
looking into it -- hope to have a solution Real Soon Now...
A workaround that may or may not be possible until we get this fixed --
alter your program to regularly check for received messages. For example,
if you're polling only once in a great while for received messages,
increase the frequency a bit so that gm doesn't timeout and drop the
packet.
As I said, this may or may not be possible within the logic of your code,
but I thought I'd mention it anyway...
On Thu, 5 Feb 2004, Bogdan Costescu wrote:
>
> [ Pressed the wrong key and the message got sent before being finished and
> with some spelling mistakes... ]
>
> On Thu, 5 Feb 2004, Sergei Lisenkov wrote:
>
> > LAM internal GM send: gmID=3 'kappa2' send failed to complete (see kernel log for details): send timed out
>
> That is exactly the error message that I metioned in a previous e-mail
> about 2 week ago, also when running with Myrinet. Jeff Squyres said that
> yet another person has seen the same message and that there might be some
> problem in LAM-MPI.
>
> > LAM internal GM send: gmID=7 'kappa5' send failed to complete (see kernel log for details): send timed out
>
> ... but you get this message from all hosts. I only got it from one host
> and in all cases that I remember, it was n1 when running on 2 nodes or n2
> when running one 3 or more nodes (and I tried on different nodes to rule
> out hardware problems).
>
> > After lamboot, I run my code:
> > mpirun -np 13 ./test.x input > output &
>
> I usually add "-v" and "-O" (letter o, not zero), which might not be
> needed nowadays, but I got used to it.
>
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|