I've just built lam-7.0 on a RedHat 7.3 system on which we've happily
been using 6.5.9 for months. I'm having trouble getting things to work,
however: lamboot fails with
lamd kernel: problem with bind(): Invalid argument
The output of laminfo and lamboot -d follows. There are no filters or
routing problems, and telnet 172.20.3.57 33577 gives me a connection
refused error.
Thanks,
Jon Bernard
LAM/MPI: 7.0
Prefix: /usr/local/lam/7.0/gnu/ssh
Architecture: i686-pc-linux-gnu
Configured by: root
Configured on: Tue Jul 8 16:07:45 CDT 2003
Configure host: cahaba
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
n0<31675> ssi:boot: Opening
n0<31675> ssi:boot: opening module globus
n0<31675> ssi:boot: initializing module globus
n0<31675> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<31675> ssi:boot: module not available: globus
n0<31675> ssi:boot: opening module rsh
n0<31675> ssi:boot: initializing module rsh
n0<31675> ssi:boot:rsh: module initializing
n0<31675> ssi:boot:rsh:agent: ssh -x
n0<31675> ssi:boot:rsh:username: <same>
n0<31675> ssi:boot:rsh:verbose: 1000
n0<31675> ssi:boot:rsh:algorithm: linear
n0<31675> ssi:boot:rsh:priority: 10
n0<31675> ssi:boot: module available: rsh, priority: 10
n0<31675> ssi:boot: finalizing module globus
n0<31675> ssi:boot:globus: finalizing
n0<31675> ssi:boot: closing module globus
n0<31675> ssi:boot: Selected boot module rsh
n0<31675> ssi:boot:base: looking for boot schema in following
directories:
n0<31675> ssi:boot:base: <current directory>
n0<31675> ssi:boot:base: $TROLLIUSHOME/etc
n0<31675> ssi:boot:base: $LAMHOME/etc
n0<31675> ssi:boot:base: /usr/local/lam/7.0/gnu/ssh/etc
n0<31675> ssi:boot:base: looking for boot schema file:
n0<31675> ssi:boot:base:
/var/spool/PBS/5.3.2/aux/13446.cahaba.cahaba.eng.uab.edu
n0<31675> ssi:boot:base: found boot schema:
/var/spool/PBS/5.3.2/aux/13446.cahaba.cahaba.eng.uab.edu
n0<31675> ssi:boot:rsh: found the following hosts:
n0<31675> ssi:boot:rsh: n0 node57 (cpu=2)
n0<31675> ssi:boot:rsh: n1 node3 (cpu=2)
n0<31675> ssi:boot:rsh: n2 node46 (cpu=2)
n0<31675> ssi:boot:rsh: n3 node47 (cpu=2)
n0<31675> ssi:boot:rsh: n4 node53 (cpu=2)
n0<31675> ssi:boot:rsh: n5 node51 (cpu=2)
n0<31675> ssi:boot:rsh: n6 node48 (cpu=2)
n0<31675> ssi:boot:rsh: n7 node58 (cpu=2)
n0<31675> ssi:boot:rsh: resolved hosts:
n0<31675> ssi:boot:rsh: n0 node57 --> 172.20.3.57 (origin)
n0<31675> ssi:boot:rsh: n1 node3 --> 172.20.3.3
n0<31675> ssi:boot:rsh: n2 node46 --> 172.20.3.46
n0<31675> ssi:boot:rsh: n3 node47 --> 172.20.3.47
n0<31675> ssi:boot:rsh: n4 node53 --> 172.20.3.53
n0<31675> ssi:boot:rsh: n5 node51 --> 172.20.3.51
n0<31675> ssi:boot:rsh: n6 node48 --> 172.20.3.48
n0<31675> ssi:boot:rsh: n7 node58 --> 172.20.3.58
n0<31675> ssi:boot:rsh: starting RTE procs
n0<31675> ssi:boot:base:linear: starting
n0<31675> ssi:boot:base:server: opening server TCP socket
n0<31675> ssi:boot:base:server: opened port 33577
n0<31675> ssi:boot:base:linear: booting n0 (node57)
n0<31675> ssi:boot:rsh: starting lamd on (node57)
n0<31675> ssi:boot:rsh: starting on n0 (node57): hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-13446.cahaba.cahaba.eng.uab.edu -I
-H 172.20.3.57 -P 33577 -n 0 -o 0
n0<31675> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to pbs-13446.cahaba.cahaba.eng.uab.edu
tkill: got killname back:
/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
.cahaba.eng.uab.edu/lam-killfile
tkill: removing socket file ...
tkill: socket file:
/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
.cahaba.eng.uab.edu/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
.cahaba.eng.uab.edu/lam-io-socket
tkill: f_kill =
"/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahab
a.cahaba.eng.uab.edu/lam-killfile"
tkill: nothing to kill:
"/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahab
a.cahaba.eng.uab.edu/lam-killfile"
hboot: performing tkill
hboot: tkill -sessionsuffix pbs-13446.cahaba.cahaba.eng.uab.edu -d
hboot: booting...
hboot: fork /usr/local/lam/7.0/gnu/ssh/bin/lamd
[1] 31679 lamd -H 172.20.3.57 -P 33577 -n 0 -o 0 -d -sessionsuffix
pbs-13446.cahaba.cahaba.eng.uab.edu
n0<31675> ssi:boot:rsh: successfully launched on n0 (node57)
n0<31675> ssi:boot:base:server: expecting connection from finite list
lamd kernel: problem with bind(): Invalid argument
n0<31675> ssi:boot:base:server: got connection from 144.155.5.8
------------------------------------------------------------------------
-----
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
------------------------------------------------------------------------
-----
n0<31675> ssi:boot:base:server: failed to connect to remote lamd!
n0<31675> ssi:boot:base:server: closing server socket
n0<31675> ssi:boot:base:linear: aborted!
------------------------------------------------------------------------
-----
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
------------------------------------------------------------------------
-----
lamboot did NOT complete successfully
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
lamboot: wipe -- nothing to do
|