On Tue, 10 Aug 2004, Jeff Squyres wrote:
> I can't answer the question about the SGE stuff, but the error you're
> getting is quite odd and may not be related.
Not quite... see below :-)
> > n0<7715> ssi:boot:base:server: expecting connection from finite list
> > n0<7715> ssi:boot:base:server: got connection from 0.0.0.0
This is exactly the kind of error that I used to get when testing the
integration between SGE and LAM when SGE was not allocating enough
slots to do both qrsh-remote and qrsh-local steps.
> So the question is: why didn't the lamd call back to lamboot?
Because SGE did not execute lamd on the remote node, so there was
nobody to call back ;-)
I bet that these are single CPU machines or SMP machines where SGE has
allocated only one slot for the job. This is a limitation of the
current SGE+LAM integration and cannot be overcome from outside SGE,
unless you give up the tight-integration and make the qrsh-remote step
be executed directly with rsh/ssh instead of qrsh, bypassing SGE.
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]
|