LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: j_reichel_at_[hidden]
Date: 2006-03-28 09:05:13


MessageHello,

i'am trying to integrate LAM to SGE 6.0. But it won't work in the right way.
I have an startlam script and i add a new Parallel Enviroment into the SGE.
But after sending the job there is no result.
I think there is a problem with the lamboot command.
I started it with the option -d to see what happens.
When i look to the logfile i can see that the lamd daemon is startet on all the Nodes of the cluster.

But after all in the last part of the logfile ist the comment that there is no lamd on the head node.

Do you have any idea?

Here are the logfile an startlam script an PE of SGE:

logfile:

n-1<10054> ssi:boot:open: opening
n-1<10054> ssi:boot:open: looking for boot module named rsh
n-1<10054> ssi:boot:open: opening boot module rsh
n-1<10054> ssi:boot:open: opened boot module rsh
n-1<10054> ssi:boot:select: initializing boot module rsh
n-1<10054> ssi:boot:rsh: module initializing
n-1<10054> ssi:boot:rsh:agent: ssh -x
n-1<10054> ssi:boot:rsh:username: <same>
n-1<10054> ssi:boot:rsh:verbose: 1000
n-1<10054> ssi:boot:rsh:algorithm: linear
n-1<10054> ssi:boot:rsh:no_n: 0
n-1<10054> ssi:boot:rsh:no_profile: 0
n-1<10054> ssi:boot:rsh:fast: 0
n-1<10054> ssi:boot:rsh:ignore_stderr: 0
n-1<10054> ssi:boot:rsh:priority: 10
n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10
n-1<10054> ssi:boot:select: selected boot module rsh
n-1<10054> ssi:boot:base: looking for boot schema in following directories:
n-1<10054> ssi:boot:base:   <current directory>
n-1<10054> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<10054> ssi:boot:base:   $LAMHOME/etc
n-1<10054> ssi:boot:base:   /usr/lib/lam/etc
n-1<10054> ssi:boot:base: looking for boot schema file:
n-1<10054> ssi:boot:base:   /tmp/78.1.all.q/machines
n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines
n-1<10054> ssi:boot:rsh: found the following hosts:
n-1<10054> ssi:boot:rsh:   n0 ppc207 (cpu=1)
n-1<10054> ssi:boot:rsh:   n1 ppc211 (cpu=1)
n-1<10054> ssi:boot:rsh:   n2 ppc203 (cpu=1)
n-1<10054> ssi:boot:rsh:   n3 ppc205 (cpu=1)
n-1<10054> ssi:boot:rsh:   n4 ppc228 (cpu=1)
n-1<10054> ssi:boot:rsh:   n5 ppc208 (cpu=1)
n-1<10054> ssi:boot:rsh:   n6 ppc206 (cpu=1)
n-1<10054> ssi:boot:rsh:   n7 ppc229 (cpu=1)
n-1<10054> ssi:boot:rsh:   n8 ppc231 (cpu=1)
n-1<10054> ssi:boot:rsh: resolved hosts:
n-1<10054> ssi:boot:rsh:   n0 ppc207 --> 141.35.13.107 (origin)
n-1<10054> ssi:boot:rsh:   n1 ppc211 --> 141.35.13.111
n-1<10054> ssi:boot:rsh:   n2 ppc203 --> 141.35.13.103
n-1<10054> ssi:boot:rsh:   n3 ppc205 --> 141.35.13.105
n-1<10054> ssi:boot:rsh:   n4 ppc228 --> 141.35.13.119
n-1<10054> ssi:boot:rsh:   n5 ppc208 --> 141.35.13.108
n-1<10054> ssi:boot:rsh:   n6 ppc206 --> 141.35.13.106
n-1<10054> ssi:boot:rsh:   n7 ppc229 --> 141.35.13.120
n-1<10054> ssi:boot:rsh:   n8 ppc231 --> 141.35.13.122
n-1<10054> ssi:boot:rsh: starting RTE procs
n-1<10054> ssi:boot:base:linear: starting
n-1<10054> ssi:boot:base:server: opening server TCP socket
n-1<10054> ssi:boot:base:server: opened port 32789
n-1<10054> ssi:boot:base:linear: booting n0 (ppc207)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc207)
n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
n-1<10054> ssi:boot:rsh: launching locally
n-1<10057> ssi:boot:open: opening
n-1<10057> ssi:boot:open: looking for boot module named rsh
n-1<10057> ssi:boot:open: opening boot module rsh
n-1<10057> ssi:boot:open: opened boot module rsh
n-1<10057> ssi:boot:select: initializing boot module rsh
n-1<10057> ssi:boot:rsh: module initializing
n-1<10057> ssi:boot:rsh:agent: ssh -x
n-1<10057> ssi:boot:rsh:username: <same>
n-1<10057> ssi:boot:rsh:verbose: 1000
n-1<10057> ssi:boot:rsh:algorithm: linear
n-1<10057> ssi:boot:rsh:no_n: 0
n-1<10057> ssi:boot:rsh:no_profile: 0
n-1<10057> ssi:boot:rsh:fast: 0
n-1<10057> ssi:boot:rsh:ignore_stderr: 0
n-1<10057> ssi:boot:rsh:priority: 10
n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10
n-1<10057> ssi:boot:select: selected boot module rsh
n-1<10057> ssi:boot:send_lamd: getting node ID from command line
n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line
n-1<10057> ssi:boot:send_lamd: getting agent port from command line
n-1<10057> ssi:boot:send_lamd: getting node ID from command line
n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789, node id 0
n-1<10057> ssi:boot:send_lamd: sending dli_port 32811
n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107
n-1<10054> ssi:boot:base:server: this connection is expected (n0)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.107:32811
n-1<10054> ssi:boot:base:linear: booting n1 (ppc211)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc211)
n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 1 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 1 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111
n-1<10054> ssi:boot:base:server: this connection is expected (n1)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.111:32803
n-1<10054> ssi:boot:base:linear: booting n2 (ppc203)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc203)
n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 2 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 2 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103
n-1<10054> ssi:boot:base:server: this connection is expected (n2)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.103:32840
n-1<10054> ssi:boot:base:linear: booting n3 (ppc205)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc205)
n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 3 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 3 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105
n-1<10054> ssi:boot:base:server: this connection is expected (n3)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.105:32812
n-1<10054> ssi:boot:base:linear: booting n4 (ppc228)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc228)
n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 4 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 4 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119
n-1<10054> ssi:boot:base:server: this connection is expected (n4)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.119:32806
n-1<10054> ssi:boot:base:linear: booting n5 (ppc208)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc208)
n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 5 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 5 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108
n-1<10054> ssi:boot:base:server: this connection is expected (n5)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.108:32821
n-1<10054> ssi:boot:base:linear: booting n6 (ppc206)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc206)
n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 6 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 6 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106
n-1<10054> ssi:boot:base:server: this connection is expected (n6)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.106:32807
n-1<10054> ssi:boot:base:linear: booting n7 (ppc229)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc229)
n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 7 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 7 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120
n-1<10054> ssi:boot:base:server: this connection is expected (n7)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.120:32798
n-1<10054> ssi:boot:base:linear: booting n8 (ppc231)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc231)
n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 8 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 8 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122
n-1<10054> ssi:boot:base:server: this connection is expected (n8)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.122:34876
n-1<10054> ssi:boot:base:server: closing server socket
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.107:32790
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.107:32790
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.111:32784
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.111:32784
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.103:32795
n-1<10057> ssi:boot:rsh: finalizing
n-1<10057> ssi:boot: Closing
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.103:32795
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.105:32792
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.105:32792
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.119:32792
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.119:32792
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.108:32793
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.108:32793
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.106:32788
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.106:32788
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.120:54488
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.120:54488
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.122:56713
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.122:56713
n-1<10054> ssi:boot:base:linear: finished
n-1<10054> ssi:boot:rsh: all RTE procs started
n-1<10054> ssi:boot:rsh: finalizing
n-1<10054> ssi:boot: Closing
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host ppc207.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------


startlam script:

#!/bin/sh
#
#
# (c) 2002 Sun Microsystems, Inc. Use is subject to license terms. 

#
# preparation of the mpi machine file
#
# usage: startmpi.sh [options] <pe_hostfile>
#
#        options are:
#                     -catch_hostname
#                      force use of hostname wrapper in $TMPDIR when starting mpirun  
#                     -catch_rsh
#                      force use of rsh wrapper in $TMPDIR when starting mpirun  
#                     -unique
#                      generate a machinefile where each hostname appears only once
#                      This is needed to setup a multithreaded mpi application
#

PeHostfile2MachineFile()
{
   cat $1 | while read line; do
      # echo $line
      host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
      nslots=`echo $line|cut -f2 -d" "`
      i=1
      while [ $i -le $nslots ]; do
         # add here code to map regular hostnames into ATM hostnames
         echo $host
         i=`expr $i + 1`
      done
   done
}


#
# startup of LAM conforming with the Grid Engine
# Parallel Environment interface
#
# on success the job will find a machine-file in $TMPDIR/machines
#

# useful to control parameters passed to us 
echo $*

# parse options
catch_rsh=0
catch_hostname=0
unique=0
while [ "$1" != "" ]; do
   case "$1" in
      -catch_rsh)
         catch_rsh=1
         ;;
      -catch_hostname)
         catch_hostname=1
         ;;
      -unique)
         unique=1
         ;;
      *)
         break;
         ;;
   esac
   shift
done

me=`basename $0`

# test number of args
if [ $# -ne 1 ]; then
   echo "$me: got wrong number of arguments" >&2
   exit 1
fi

# get arguments
pe_hostfile=$1

# ensure pe_hostfile is readable
if [ ! -r $pe_hostfile ]; then
   echo "$me: can't read $pe_hostfile" >&2
   exit 1
fi

# create machine-file
# remove column with number of slots per queue
# mpi does not support them in this form
machines="$TMPDIR/machines"

if [ $unique = 1 ]; then
   PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
else
   PeHostfile2MachineFile $pe_hostfile >> $machines
fi

# trace machines file
cat $machines

#
# Make script wrapper for 'rsh' available in jobs tmp dir
#
if [ $catch_rsh = 1 ]; then
   rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh
   if [ ! -x $rsh_wrapper ]; then
      echo "$me: can't execute $rsh_wrapper" >&2
      echo "     maybe it resides at a file system not available at this machine" >&2
      exit 1
   fi

   rshcmd=rsh
   case "$ARC" in
      hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
      *) ;;
   esac
   # note: This could also be done using rcp, ftp or s.th.
   #       else. We use a symbolic link since it is the
   #       cheapest in case of a shared filesystem
   #
   ln -s $rsh_wrapper $TMPDIR/$rshcmd
fi

#
# Make script wrapper for 'hostname' available in jobs tmp dir
#
if [ $catch_hostname = 1 ]; then
   hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname
   if [ ! -x $hostname_wrapper ]; then
      echo "$me: can't execute $hostname_wrapper" >&2
      echo "     maybe it resides at a file system not available at this machine" >&2
      exit 1
   fi

   # note: This could also be done using rcp, ftp or s.th.
   #       else. We use a symbolic link since it is the
   #       cheapest in case of a shared filesystem
   #
   ln -s $hostname_wrapper $TMPDIR/hostname
fi

#
# Extra LAM statement(s)
#
#if [ -z "`which lamboot 2>/dev/null`" ] ; then
#    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
#fi
#lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines
# signal success to caller
lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines
echo "lamboot beendet"
#signal success to caller
exit 0
case




PE in SGE:

pe_name           lam7
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/grid/sge6.0/mpi/lam_loose_ssh/startlam.sh -unique $pe_hostfile
stop_proc_args    /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Regards

Joerg

--

Hello, i'am trying to integrate LAM to SGE 6.0. But it won't work in the right way. I have an startlam script and i add a new Parallel Enviroment into the SGE. But after sending the job there is no result. I think there is a problem with the lamboot command. I started it with the option -d to see what happens. When i look to the logfile i can see that the lamd daemon is startet on all the Nodes of the cluster. But after all in the last part of the logfile ist the comment that there is no lamd on the head node. Do you have any idea? Here are the logfile an startlam script an PE of SGE: logfile: n-1<10054> ssi:boot:open: opening n-1<10054> ssi:boot:open: looking for boot module named rsh n-1<10054> ssi:boot:open: opening boot module rsh n-1<10054> ssi:boot:open: opened boot module rsh n-1<10054> ssi:boot:select: initializing boot module rsh n-1<10054> ssi:boot:rsh: module initializing n-1<10054> ssi:boot:rsh:agent: ssh -x n-1<10054> ssi:boot:rsh:username: <same> n-1<10054> ssi:boot:rsh:verbose: 1000 n-1<10054> ssi:boot:rsh:algorithm: linear n-1<10054> ssi:boot:rsh:no_n: 0 n-1<10054> ssi:boot:rsh:no_profile: 0 n-1<10054> ssi:boot:rsh:fast: 0 n-1<10054> ssi:boot:rsh:ignore_stderr: 0 n-1<10054> ssi:boot:rsh:priority: 10 n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10 n-1<10054> ssi:boot:select: selected boot module rsh n-1<10054> ssi:boot:base: looking for boot schema in following directories: n-1<10054> ssi:boot:base: <current directory> n-1<10054> ssi:boot:base: $TROLLIUSHOME/etc n-1<10054> ssi:boot:base: $LAMHOME/etc n-1<10054> ssi:boot:base: /usr/lib/lam/etc n-1<10054> ssi:boot:base: looking for boot schema file: n-1<10054> ssi:boot:base: /tmp/78.1.all.q/machines n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines n-1<10054> ssi:boot:rsh: found the following hosts: n-1<10054> ssi:boot:rsh: n0 ppc207 (cpu=1) n-1<10054> ssi:boot:rsh: n1 ppc211 (cpu=1) n-1<10054> ssi:boot:rsh: n2 ppc203 (cpu=1) n-1<10054> ssi:boot:rsh: n3 ppc205 (cpu=1) n-1<10054> ssi:boot:rsh: n4 ppc228 (cpu=1) n-1<10054> ssi:boot:rsh: n5 ppc208 (cpu=1) n-1<10054> ssi:boot:rsh: n6 ppc206 (cpu=1) n-1<10054> ssi:boot:rsh: n7 ppc229 (cpu=1) n-1<10054> ssi:boot:rsh: n8 ppc231 (cpu=1) n-1<10054> ssi:boot:rsh: resolved hosts: n-1<10054> ssi:boot:rsh: n0 ppc207 --> 141.35.13.107 (origin) n-1<10054> ssi:boot:rsh: n1 ppc211 --> 141.35.13.111 n-1<10054> ssi:boot:rsh: n2 ppc203 --> 141.35.13.103 n-1<10054> ssi:boot:rsh: n3 ppc205 --> 141.35.13.105 n-1<10054> ssi:boot:rsh: n4 ppc228 --> 141.35.13.119 n-1<10054> ssi:boot:rsh: n5 ppc208 --> 141.35.13.108 n-1<10054> ssi:boot:rsh: n6 ppc206 --> 141.35.13.106 n-1<10054> ssi:boot:rsh: n7 ppc229 --> 141.35.13.120 n-1<10054> ssi:boot:rsh: n8 ppc231 --> 141.35.13.122 n-1<10054> ssi:boot:rsh: starting RTE procs n-1<10054> ssi:boot:base:linear: starting n-1<10054> ssi:boot:base:server: opening server TCP socket n-1<10054> ssi:boot:base:server: opened port 32789 n-1<10054> ssi:boot:base:linear: booting n0 (ppc207) n-1<10054> ssi:boot:rsh: starting lamd on (ppc207) n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0 n-1<10054> ssi:boot:rsh: launching locally n-1<10057> ssi:boot:open: opening n-1<10057> ssi:boot:open: looking for boot module named rsh n-1<10057> ssi:boot:open: opening boot module rsh n-1<10057> ssi:boot:open: opened boot module rsh n-1<10057> ssi:boot:select: initializing boot module rsh n-1<10057> ssi:boot:rsh: module initializing n-1<10057> ssi:boot:rsh:agent: ssh -x n-1<10057> ssi:boot:rsh:username: <same> n-1<10057> ssi:boot:rsh:verbose: 1000 n-1<10057> ssi:boot:rsh:algorithm: linear n-1<10057> ssi:boot:rsh:no_n: 0 n-1<10057> ssi:boot:rsh:no_profile: 0 n-1<10057> ssi:boot:rsh:fast: 0 n-1<10057> ssi:boot:rsh:ignore_stderr: 0 n-1<10057> ssi:boot:rsh:priority: 10 n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10 n-1<10057> ssi:boot:select: selected boot module rsh n-1<10057> ssi:boot:send_lamd: getting node ID from command line n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line n-1<10057> ssi:boot:send_lamd: getting agent port from command line n-1<10057> ssi:boot:send_lamd: getting node ID from command line n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789, node id 0 n-1<10057> ssi:boot:send_lamd: sending dli_port 32811 n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107 n-1<10054> ssi:boot:base:server: this connection is expected (n0) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.107:32811 n-1<10054> ssi:boot:base:linear: booting n1 (ppc211) n-1<10054> ssi:boot:rsh: starting lamd on (ppc211) n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 1 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 1 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111 n-1<10054> ssi:boot:base:server: this connection is expected (n1) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.111:32803 n-1<10054> ssi:boot:base:linear: booting n2 (ppc203) n-1<10054> ssi:boot:rsh: starting lamd on (ppc203) n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 2 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 2 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103 n-1<10054> ssi:boot:base:server: this connection is expected (n2) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.103:32840 n-1<10054> ssi:boot:base:linear: booting n3 (ppc205) n-1<10054> ssi:boot:rsh: starting lamd on (ppc205) n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 3 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 3 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105 n-1<10054> ssi:boot:base:server: this connection is expected (n3) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.105:32812 n-1<10054> ssi:boot:base:linear: booting n4 (ppc228) n-1<10054> ssi:boot:rsh: starting lamd on (ppc228) n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 4 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 4 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119 n-1<10054> ssi:boot:base:server: this connection is expected (n4) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.119:32806 n-1<10054> ssi:boot:base:linear: booting n5 (ppc208) n-1<10054> ssi:boot:rsh: starting lamd on (ppc208) n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 5 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 5 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108 n-1<10054> ssi:boot:base:server: this connection is expected (n5) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.108:32821 n-1<10054> ssi:boot:base:linear: booting n6 (ppc206) n-1<10054> ssi:boot:rsh: starting lamd on (ppc206) n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 6 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 6 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106 n-1<10054> ssi:boot:base:server: this connection is expected (n6) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.106:32807 n-1<10054> ssi:boot:base:linear: booting n7 (ppc229) n-1<10054> ssi:boot:rsh: starting lamd on (ppc229) n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 7 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 7 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120 n-1<10054> ssi:boot:base:server: this connection is expected (n7) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.120:32798 n-1<10054> ssi:boot:base:linear: booting n8 (ppc231) n-1<10054> ssi:boot:rsh: starting lamd on (ppc231) n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam-conf.lamd -d - sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 8 -o 0" n-1<10054> ssi:boot:rsh: launching remotely n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n 'echo $SHELL' n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n hboot -t -c lam -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 -n 8 -o 0"' n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231) n-1<10054> ssi:boot:base:server: expecting connection from finite list n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122 n-1<10054> ssi:boot:base:server: this connection is expected (n8) n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.122:34876 n-1<10054> ssi:boot:base:server: closing server socket n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.107:32790 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.107:32790 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.111:32784 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.111:32784 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.103:32795 n-1<10057> ssi:boot:rsh: finalizing n-1<10057> ssi:boot: Closing n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.103:32795 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.105:32792 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.105:32792 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.119:32792 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.119:32792 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.108:32793 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.108:32793 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.106:32788 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.106:32788 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.120:54488 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.120:54488 n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.122:56713 n-1<10054> ssi:boot:base:server: connected n-1<10054> ssi:boot:base:server: sending number of links (9) n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207) n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211) n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203) n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205) n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228) n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208) n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206) n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229) n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231) n-1<10054> ssi:boot:base:server: finished sending n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.122:56713 n-1<10054> ssi:boot:base:linear: finished n-1<10054> ssi:boot:rsh: all RTE procs started n-1<10054> ssi:boot:rsh: finalizing n-1<10054> ssi:boot: Closing ----------------------------------------------------------------------------- It seems that there is no lamd running on the host ppc207. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for the "lamhalt" command. Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----------------------------------------------------------------------------- startlam script: #!/bin/sh # # # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms. # # preparation of the mpi machine file # # usage: startmpi.sh [options] <pe_hostfile> # # options are: # -catch_hostname # force use of hostname wrapper in $TMPDIR when startingmpirun # -catch_rsh # force use of rsh wrapper in $TMPDIR when starting mpirun # -unique # generate a machinefile where each hostname appears only once # This is needed to setup a multithreaded mpi application # PeHostfile2MachineFile() { cat $1 | while read line; do # echo $line host=`echo $line|cut -f1 -d" "|cut -f1 -d"."` nslots=`echo $line|cut -f2 -d" "` i=1 while [ $i -le $nslots ]; do # add here code to map regular hostnames into ATM hostnames echo $host i=`expr $i + 1` done done } # # startup of LAM conforming with the Grid Engine # Parallel Environment interface # # on success the job will find a machine-file in $TMPDIR/machines # # useful to control parameters passed to us echo $* # parse options catch_rsh=0 catch_hostname=0 unique=0 while [ "$1" != "" ]; do case "$1" in -catch_rsh) catch_rsh=1 ;; -catch_hostname) catch_hostname=1 ;; -unique) unique=1 ;; *) break; ;; esac shift done me=`basename $0` # test number of args if [ $# -ne 1 ]; then echo "$me: got wrong number of arguments" >&2 exit 1 fi # get arguments pe_hostfile=$1 # ensure pe_hostfile is readable if [ ! -r $pe_hostfile ]; then echo "$me: can't read $pe_hostfile" >&2 exit 1 fi # create machine-file # remove column with number of slots per queue # mpi does not support them in this form machines="$TMPDIR/machines" if [ $unique = 1 ]; then PeHostfile2MachineFile $pe_hostfile | uniq >> $machines else PeHostfile2MachineFile $pe_hostfile >> $machines fi # trace machines file cat $machines # # Make script wrapper for 'rsh' available in jobs tmp dir # if [ $catch_rsh = 1 ]; then rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh if [ ! -x $rsh_wrapper ]; then echo "$me: can't execute $rsh_wrapper" >&2 echo " maybe itresides at a file system not available at this machine" >&2 exit 1 fi rshcmd=rsh case "$ARC" in hp|hp10|hp11|hp11-64) rshcmd=remsh ;; *) ;; esac # note: This could also be done using rcp, ftp or s.th. # else. We use a symbolic link since it is the # cheapest in case of a shared filesystem # ln -s $rsh_wrapper $TMPDIR/$rshcmd fi # # Make script wrapper for 'hostname' available in jobs tmp dir # if [ $catch_hostname = 1 ]; then hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname if [ ! -x $hostname_wrapper ]; then echo "$me: can't execute $hostname_wrapper" >&2 echo " maybe itresides at a file system not available at this machine" >&2 exit 1 fi # note: This could also be done using rcp, ftp or s.th. # else. We use a symbolic link since it is the # cheapest in case of a shared filesystem # ln -s $hostname_wrapper $TMPDIR/hostname fi # # Extra LAM statement(s) # #if [ -z "`which lamboot 2>/dev/null`" ] ; then # export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH #fi #lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines # signal success to caller lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines echo "lamboot beendet" #signal success to caller exit 0 case PE in SGE: pe_name lam7 slots 999 user_lists NONE xuser_lists NONE start_proc_args /usr/local/grid/sge6.0/mpi/lam_loose_ssh/startlam.sh -unique $pe_hostfile stop_proc_args /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min Regards Joerg