LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-08-10 20:07:49


On Aug 10, 2004, at 4:01 PM, C.L. Lai [ALAN] wrote:

> I am trying to get that script working on SGE 6 + LAM 7
> However, I got some errors, I wonder if it's the script's problem or
> it's
> my setting.

I can't answer the question about the SGE stuff, but the error you're
getting is quite odd and may not be related. Let's investigate that
first, and if that doesn't work out, we'll ping the SGE guys and see
what they say. :-)

> Here is my error
> %cat sgedebug.528.7715
> SGE-LAM DEBUG: LAMHOME = /usr
<snipped]
> n0<7715> ssi:boot:rsh: starting lamd on (jardine2.math.uwo.ca)
> n0<7715> ssi:boot:rsh: starting on n0 (jardine2.math.uwo.ca): hboot -t
> -c
> /home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-528-0
> -I
> -H 129.100.75.78 -P 36671 -n 0 -o 0
> n0<7715> ssi:boot:rsh: launching locally
> n0<7715> ssi:boot:rsh: successfully launched on n0
> (jardine2.math.uwo.ca)
> n0<7715> ssi:boot:base:server: expecting connection from finite list
> n0<7715> ssi:boot:base:server: got connection from 0.0.0.0

What's happening here is that LAM forked off the lamd locally, but then
the lamd didn't call back to lamboot and say "I'm ok!". lamboot
eventually got tired of waiting and gave up.

So the question is: why didn't the lamd call back to lamboot?

The most common reason for this is firewalling software -- LAM uses
random TCP and UDP ports assigned by the OS. Hence, you either need to
disable firewalling software or allow TCP and UDP traffic on random
ports from your trusted set of nodes (including the localhost).

You might want to look in the syslogs -- "lamboot -d" causes the lamd
to output some information to the syslogs; there may be information in
there about why the lamd died before connecting back to lamboot. Also
look for a corefile indicating that the lamd aborted improperly.

Let me know what you find.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/