LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: C.L. Lai [ALAN] (clai33_at_[hidden])
Date: 2004-08-11 09:47:55


On Wed, 11 Aug 2004, Bogdan Costescu wrote:

> On Tue, 10 Aug 2004, Jeff Squyres wrote:
>
> > I can't answer the question about the SGE stuff, but the error you're
> > getting is quite odd and may not be related.
>
> Not quite... see below :-)
>
> > > n0<7715> ssi:boot:base:server: expecting connection from finite list
> > > n0<7715> ssi:boot:base:server: got connection from 0.0.0.0
>
> This is exactly the kind of error that I used to get when testing the
> integration between SGE and LAM when SGE was not allocating enough
> slots to do both qrsh-remote and qrsh-local steps.
>
> > So the question is: why didn't the lamd call back to lamboot?
>
> Because SGE did not execute lamd on the remote node, so there was
> nobody to call back ;-)

Why doesn't SGE execute lamd on the remote nodes?

>
> I bet that these are single CPU machines or SMP machines where SGE has
> allocated only one slot for the job. This is a limitation of the
> current SGE+LAM integration and cannot be overcome from outside SGE,
> unless you give up the tight-integration and make the qrsh-remote step
> be executed directly with rsh/ssh instead of qrsh, bypassing SGE.
>

You lost the bet.
the number of slots equals the number of processors on each node as it
seems from SGE.

Any resolution?

Thanks,
Alan.

> --
> Bogdan Costescu
>
> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
> E-mail: Bogdan.Costescu_at_[hidden]
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>