Hi, I've been fighting with LAM on our cluster recently. For some reason,
lamboot can be run from the command line of the root node or any of the
nodes, and it works fine. However, MPI programs fail because Sun Grid
Engine gets in the way; we had the same problem with MPICH. For some
reason, when attempting to use the Sun Grid Engine's queues to manage LAM
jobs, lamboot fails with the error message below. There are no firewalls
protecting the nodes as far as I can see; the ipchains files leave
everything wide open. Has anyone had any experience with this?
Thanks,
Jason
(previous initialization messages, then:)
P4_GLOBMEMSIZE=256000000
GEXEC_SVRS=n01 n02 n03 n04 n05 n06 n07 n08 n09 n10 n11 n12 n13 n14 n15 n16
n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 n33 n34 n35
n36 n37 n38 n39 n40 n41 n42 n43 n44
SGE_O_LOGNAME=jmachace
SGE_CELL=default
n0<23068> ssi:boot:rsh: attempting to execute
"/opt/lam-mpi_7.0.3/bin/qrsh-lam remote n29.bw01.ibest.uidaho.edu -n hboot
-t -c lam-conf.lamd -d -sessionsuffix sge-13263-0 -s -I "-H 192.168.0.17 -P
47600 -n 14 -o 0""
n0<23068> ssi:boot:rsh: successfully launched on n14
(n29.bw01.ibest.uidaho.edu)n0<23068> ssi:boot:base:server: expecting
connection from finite list
n0<23068> ssi:boot:base:server: got connection from 192.168.0.29
n0<23068> ssi:boot:base:server: this connection is expected (n14)
n0<23068> ssi:boot:base:server: remote lamd is at 192.168.0.29:32815
n0<23068> ssi:boot:base:server: closing server socket
n0<23068> ssi:boot:base:server: connecting to lamd at 192.168.0.17:47601
n0<23068> ssi:boot:base:server: connected
n0<23068> ssi:boot:base:server: sending number of links (15)
n0<23068> ssi:boot:base:server: sending info: n0 (n17.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n1 (n14.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n2 (n30.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n3 (n40.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n4 (n16.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n5 (n01.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n6 (n04.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n7 (n27.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n8 (n32.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n9 (n13.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n10
(n43.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n11
(n10.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n12
(n21.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n13
(n18.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: sending info: n14
(n29.bw01.ibest.uidaho.edu)
n0<23068> ssi:boot:base:server: finished sending
n0<23068> ssi:boot:base:server: disconnected from 192.168.0.17:47601
n0<23068> ssi:boot:base:server: connecting to lamd at 192.168.0.14:59725
-----------------------------------------------------------------------------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 192.168.0.14, port 59725.
Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process. As such, this may indicate:
1. 192.168.0.14 is not the correct IP address for the machine where
the
newly-booted machine was launched
2. There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
3. Network routing from the the local host to the remote isn't
properly configured (this is unlikely)
For number 1, check to ensure that 192.168.0.14 is the correct IP address
for
that machine. If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 192.168.0.14 is both reachable and is the
by
the host where the lamboot agent is running, and is the correct host.
For numbers 2 and 4, try to telnet to 192.168.0.14, port 59725. You should
get
a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port. If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------------
n0<23068> ssi:boot:base:server: connecting to lamd at 192.168.0.30:49476
Warning: You are not root -- using TCP pingscan rather than ICMP
-----------------------------------------------------------------------------
_________________________________________________________________
Check your PC for viruses with the FREE McAfee online computer scan.
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
|