LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: C.L. Lai [ALAN] (clai33_at_[hidden])
Date: 2004-08-11 17:29:05


On Wed, 11 Aug 2004, Bogdan Costescu wrote:

> On Wed, 11 Aug 2004, C.L. Lai [ALAN] wrote:
>
> > The scheduler tends to take as many nodes as possible
>
> If this is not what you want, try setting allocation_rule to $fill_up
> in the PE definition (see 'man sge_pe'), which will allocate as many
> slots as possible from a node before going to another node.
>
> > In such case, the number of allocated slots for the job is only 1.
> > So did you say the problem comes from here?
>
> Yes.
>
> > I have also tried specifying more required-processors than the number of
> > nodes, so that some nodes will have more than 1 slot allocated, but the
> > result is the same.
>
> lamd needs to be booted in all nodes, so you have to have _all_ nodes
> with more than 1 slot allocated, not only some of them.
>
> > Is it like the startup of lamd is treated as part of the job and
> > requires an extra slot for this startup?
>
> Yes.
>
> > So that the required-slots is allocated-slots + 1 ?
>
> Required slots is always 2. One for the qrsh-remote step, one for the
> qrsh-local step. After lamd is started by the qrsh-local step, SGE is
> not involved anymore, all processes are children of lamd. LAM doesn't
> care about the slots allocated by SGE, but due to the fact that the
> boot schema is created based on the information from SGE, LAM will
> only start on a node as many processes as slots allocated by SGE.
> However, because lamd was started by qrsh, SGE still has control over
> how much time the whole LAM ensamble (lamd + children) can run and can
> also send signals (for example, to achieve qdel on a running job).

I tried submitting a job to an SMP machine with 2 slots, however it still
fails with the same error.

Here is the error:

SGE-LAM DEBUG: LAMHOME = /usr
SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
SGE-LAM DEBUG: PATH =
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
SGE-LAM DEBUG: qrsh = /home/compute/sge/bin/lx26-amd64/qrsh
SGE-LAM DEBUG: ARGV = ""
SGE-LAM DEBUG: sgelamconf = /home/compute/sge/lam/sge-lam-conf.lamd
SGE-LAM DEBUG: func=start
SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi boot_rsh_agent
/home/compute/sge/lam/sge-lam qrsh-remote -c
/home/compute/sge/lam/sge-lam-conf.lamd -v -d /tmp/537.1.all.q/lamhostfile
/tmp/537.1.all.q/lamhostfile
SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca cpu=2
n0<24778> ssi:boot: Opening
n0<24778> ssi:boot: looking for module named rsh
n0<24778> ssi:boot: opening module rsh
n0<24778> ssi:boot: initializing module rsh
n0<24778> ssi:boot:rsh: module initializing
n0<24778> ssi:boot:rsh:agent: /home/compute/sge/lam/sge-lam qrsh-remote
n0<24778> ssi:boot:rsh:username: <same>
n0<24778> ssi:boot:rsh:verbose: 1000
n0<24778> ssi:boot:rsh:algorithm: linear
n0<24778> ssi:boot:rsh:priority: 10
n0<24778> ssi:boot: Selected boot module rsh
n0<24778> ssi:boot:base: looking for boot schema in following directories:
n0<24778> ssi:boot:base: <current directory>
n0<24778> ssi:boot:base: $TROLLIUSHOME/etc
n0<24778> ssi:boot:base: $LAMHOME/etc
n0<24778> ssi:boot:base: /etc/lam
n0<24778> ssi:boot:base: looking for boot schema file:
n0<24778> ssi:boot:base: /tmp/537.1.all.q/lamhostfile
n0<24778> ssi:boot:base: found boot schema: /tmp/537.1.all.q/lamhostfile
n0<24778> ssi:boot:rsh: found the following hosts:
n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca (cpu=2)
n0<24778> ssi:boot:rsh: resolved hosts:
n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca --> 129.100.75.80
(origin)
n0<24778> ssi:boot:rsh: starting RTE procs
n0<24778> ssi:boot:base:linear: starting
n0<24778> ssi:boot:base:server: opening server TCP socket
n0<24778> ssi:boot:base:server: opened port 35804
n0<24778> ssi:boot:base:linear: booting n0 (rational.math.uwo.ca)
n0<24778> ssi:boot:rsh: starting lamd on (rational.math.uwo.ca)
n0<24778> ssi:boot:rsh: starting on n0 (rational.math.uwo.ca): hboot -t -c
/home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-537-0 -I
-H 129.100.75.80 -P 35804 -n 0 -o 0
n0<24778> ssi:boot:rsh: launching locally
n0<24778> ssi:boot:rsh: successfully launched on n0 (rational.math.uwo.ca)
n0<24778> ssi:boot:base:server: expecting connection from finite list
n0<24778> ssi:boot:base:server: got connection from 0.0.0.0
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n0<24778> ssi:boot:base:server: failed to connect to remote lamd!
n0<24778> ssi:boot:base:server: closing server socket
n0<24778> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
lamboot did NOT complete successfully

and 1 from qrsh-local:

SGE-LAM DEBUG: LAMHOME = /usr
SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
SGE-LAM DEBUG: PATH =
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
SGE-LAM DEBUG: qrsh = /home/compute/sge/bin/lx26-amd64/qrsh
SGE-LAM DEBUG: ARGV =
"/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804" "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
SGE-LAM DEBUG: sgelamconf = /home/compute/sge/lam/sge-lam-conf.lamd
SGE-LAM DEBUG: func=qrsh-local
SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin -V
rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80 -P 35804 -n 0 -o 0 -d
-sessionsuffix sge-537-0
SGE-LAM DEBUG: Exec qrsh-local: /home/compute/sge/bin/lx26-amd64/qrsh
-inherit -nostdin -V rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
-P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
rcmd: socket: Permission denied

I submitted the job by qsub -pe lam 2 stuff.sh when all the SGE nodes are
suspended except 1.

here is my %qconf -sp lam
pe_name lam
slots 100
user_lists NONE
xuser_lists NONE
start_proc_args /home/compute/sge/lam/sge-lam start
stop_proc_args /home/compute/sge/lam/sge-lam stop
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min

Thanks,
Alan.

>
> --
> Bogdan Costescu
>
> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
> E-mail: Bogdan.Costescu_at_[hidden]
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>