Hi,
I am trying to get that script working on SGE 6 + LAM 7
However, I got some errors, I wonder if it's the script's problem or it's
my setting.
Here is my error
%cat sgedebug.528.7715
SGE-LAM DEBUG: LAMHOME = /usr
SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
SGE-LAM DEBUG: PATH =
/tmp/528.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
SGE-LAM DEBUG: qrsh = /home/compute/sge/bin/lx26-amd64/qrsh
SGE-LAM DEBUG: ARGV = ""
SGE-LAM DEBUG: sgelamconf = /home/compute/sge/lam/sge-lam-conf.lamd
SGE-LAM DEBUG: func=start
SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi boot_rsh_agent
/home/compute/sge/lam/sge-lam qrsh-remote -c
/home/compute/sge/lam/sge-lam-conf.lamd -v -d /tmp/528.1.all.q/lamhostfile
/tmp/528.1.all.q/lamhostfile
n0<7715> ssi:boot: Opening
n0<7715> ssi:boot: looking for module named rsh
n0<7715> ssi:boot: opening module rsh
n0<7715> ssi:boot: initializing module rsh
n0<7715> ssi:boot:rsh: module initializing
n0<7715> ssi:boot:rsh:agent: /home/compute/sge/lam/sge-lam qrsh-remote
n0<7715> ssi:boot:rsh:username: <same>
n0<7715> ssi:boot:rsh:verbose: 1000
n0<7715> ssi:boot:rsh:algorithm: linear
n0<7715> ssi:boot:rsh:priority: 10
n0<7715> ssi:boot: Selected boot module rsh
n0<7715> ssi:boot:base: looking for boot schema in following directories:
n0<7715> ssi:boot:base: <current directory>
n0<7715> ssi:boot:base: $TROLLIUSHOME/etc
n0<7715> ssi:boot:base: $LAMHOME/etc
n0<7715> ssi:boot:base: /etc/lam
n0<7715> ssi:boot:base: looking for boot schema file:
n0<7715> ssi:boot:base: /tmp/528.1.all.q/lamhostfile
n0<7715> ssi:boot:base: found boot schema: /tmp/528.1.all.q/lamhostfile
n0<7715> ssi:boot:rsh: found the following hosts:
n0<7715> ssi:boot:rsh: n0 jardine2.math.uwo.ca (cpu=1)
n0<7715> ssi:boot:rsh: n1 temp.math.uwo.ca (cpu=1)
n0<7715> ssi:boot:rsh: resolved hosts:
n0<7715> ssi:boot:rsh: n0 jardine2.math.uwo.ca --> 129.100.75.78
(origin)
n0<7715> ssi:boot:rsh: n1 temp.math.uwo.ca --> 129.100.75.99
n0<7715> ssi:boot:rsh: starting RTE procs
n0<7715> ssi:boot:base:linear: starting
n0<7715> ssi:boot:base:server: opening server TCP socket
n0<7715> ssi:boot:base:server: opened port 36671
n0<7715> ssi:boot:base:linear: booting n0 (jardine2.math.uwo.ca)
n0<7715> ssi:boot:rsh: starting lamd on (jardine2.math.uwo.ca)
n0<7715> ssi:boot:rsh: starting on n0 (jardine2.math.uwo.ca): hboot -t -c
/home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-528-0 -I
-H 129.100.75.78 -P 36671 -n 0 -o 0
n0<7715> ssi:boot:rsh: launching locally
n0<7715> ssi:boot:rsh: successfully launched on n0 (jardine2.math.uwo.ca)
n0<7715> ssi:boot:base:server: expecting connection from finite list
n0<7715> ssi:boot:base:server: got connection from 0.0.0.0
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n0<7715> ssi:boot:base:server: failed to connect to remote lamd!
n0<7715> ssi:boot:base:server: closing server socket
n0<7715> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
lamboot did NOT complete successfully
I made some changes to the script, but they are only some variable
changes. So should be irrelavent with this error.
Thanks,
Alan.
On Thu, 5 Aug 2004, Jeff Squyres wrote:
> Try the script included in this post:
>
> http://www.lam-mpi.org/MailArchives/lam/msg07480.php
>
> I don't see this script posted on the SGE site; I'll ping the SGE
> developers and see if they can post it there.
>
> > Hi,
> >
> > I have read a mention about a sge-lam script for SGE 6 + LAM 7 that is
> > not
> > yet being published.
> > So where can I obtain that not-yet-published sge-lam script?
> > Is that script all I need to get SGE 6 + LAM 7 working properly?
> >
> > Thank you,
> > Alan.
> >
> >
> > On Tue, 3 Aug 2004, Jeff Squyres wrote:
> >
> >> LAM 7.0.6 uses the appropriate environment variables that SGE provides
> >> to create job-private universes. After extensive discussions with the
> >> SGE developers, this was decided to be the best model of SGE+LAM
> >> integration -- make LAM "aware" of SGE, but leave the integration
> >> parts
> >> outside of LAM and in external scripts. Users have posted helpful
> >> SGE/LAM scripts to this list before; search the archives and you
> >> should
> >> find them. Also, I seem to recall that the official SGE web site has
> >> a
> >> script and/or instructions on how to integrate SGE and LAM.
> >>
> >> Hope that helps.
> >>
> >>
> >> On Aug 2, 2004, at 7:40 PM, C.L. Lai [ALAN] wrote:
> >>
> >>> Somebody might have asked about this before.
> >>> I just want to make sure, is SGE6 + LAM7 supported, or anyone got it
> >>> working flawlessly?
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} jsquyres_at_[hidden]
> >> {+} http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|