LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (bogdan.costescu_at_[hidden])
Date: 2004-08-11 09:06:24


On Tue, 10 Aug 2004, Jeff Squyres wrote:

> I can't answer the question about the SGE stuff, but the error you're
> getting is quite odd and may not be related.

Not quite... see below :-)

> > n0<7715> ssi:boot:base:server: expecting connection from finite list
> > n0<7715> ssi:boot:base:server: got connection from 0.0.0.0

This is exactly the kind of error that I used to get when testing the
integration between SGE and LAM when SGE was not allocating enough
slots to do both qrsh-remote and qrsh-local steps.

> So the question is: why didn't the lamd call back to lamboot?

Because SGE did not execute lamd on the remote node, so there was
nobody to call back ;-)

I bet that these are single CPU machines or SMP machines where SGE has
allocated only one slot for the job. This is a limitation of the
current SGE+LAM integration and cannot be overcome from outside SGE,
unless you give up the tight-integration and make the qrsh-remote step
be executed directly with rsh/ssh instead of qrsh, bypassing SGE.

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]